In [None]:
import numpy as np

#Pandas
Is another module. It's main goal is data manipulation and analysis in Python. It includes new data structures and operations to manipulate data tables.

In [None]:
import pandas as pd

##Series
Series are one-dimensional data structures, like lists or arrays in numpy. What differentiates the series is that the index of each element can be a label that we assign, similar to what is done in dictionaries. To create a series we use the Series() method where we will add the values and, if we want, the custom indices.

In [None]:
my_series=pd.Series(np.array([1,2,3,4,5]), index=['a', 'b', 'c', 'd', 'e'])
print(my_series)

series1=pd.Series(np.array([1, 2, 3, 4, 5]), index=['a', 'b', 'c', 'd', 'e'])
print(series1)

a_dictionary={'a':7,'b':8,'c':9}
series2=pd.Series(a_dictionary)
print(series2)

a    1
b    2
c    3
d    4
e    5
dtype: int64
a    1
b    2
c    3
d    4
e    5
dtype: int64
a    7
b    8
c    9
dtype: int64


we can access series by index, range or position

In [None]:
print(series1['a'])
print(series1[3])
print(series1[:2])

1
4
a    1
b    2
dtype: int64


In [None]:
print(series1.index)
print(series1.values)

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
[1 2 3 4 5]


Series are based on numpy arrays. For this reason, we can easily apply vectorized operations, just like we did with arrays in numpy. However, it is important that both strings have the same indices, otherwise they may return unexpected results.

In [None]:
result=series1+series2
print(result)

print(series1*5)
print(np.sqrt(series1))

a     8.0
b    10.0
c    12.0
d     NaN
e     NaN
dtype: float64
a     5
b    10
c    15
d    20
e    25
dtype: int64
a    1.000000
b    1.414214
c    1.732051
d    2.000000
e    2.236068
dtype: float64


#Dataframe 
Dataframes are the most used structure in pandas. These dataframes are a two-dimensional structure of labeled data, that is, they represent a table where each position of said table has a label in the row and another label in the column. There are many ways to build dataframes. They all use pandas's DataFrame() function. 
For example, we can build a dataframe from a dictionary that stores two strings:

In [None]:
a_dict={'col1': pd.Series([1,2,3], index=['a','b','c']),
       'col2': pd.Series([5,6,8], index=['a','b','d'])}
df_small=pd.DataFrame(a_dict)
df_small




Unnamed: 0,col1,col2
a,1.0,5.0
b,2.0,6.0
c,3.0,
d,,8.0


Another possible way to create a dataframe is through a list of dictionaries:

In [None]:
mylist=[{'a':1,'b':2,'c':5}]
df_small2=pd.DataFrame(mylist)
df_small2


Unnamed: 0,a,b,c
0,1,2,5


You can create a dataframe from a csv file. In some cases, these csv files include headers. In some other, they don't, so you have to create them. By default, the read_csv function assumes the file has headers

In [None]:
import pandas as pd
import requests


url1 = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
url2 = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/heart-disease.names"
df = pd.read_csv(url1, header=None)
df.columns=['age','sex','chest_pain_type',
            'rest_blood_pressure','serum_cholesterol',
            'fasting_blood_sugar','rest_ecg','max_HR_achieved',
            'exercise_induced_angina','ST_depression','ST_slope','vessels_fluoroscopy','thal','heart_disease']

df_info = requests.get(url2).text



The dataset we downloaded is hosted at the UCI repository. In order to now more about it, we have also fetched the text file that describes in detail what information we can find in this dataset

In [None]:
print(df_info)

Publication Request: 
   >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
   This file describes the contents of the heart-disease directory.

   This directory contains 4 databases concerning heart disease diagnosis.
   All attributes are numeric-valued.  The data was collected from the
   four following locations:

     1. Cleveland Clinic Foundation (cleveland.data)
     2. Hungarian Institute of Cardiology, Budapest (hungarian.data)
     3. V.A. Medical Center, Long Beach, CA (long-beach-va.data)
     4. University Hospital, Zurich, Switzerland (switzerland.data)

   Each database has the same instance format.  While the databases have 76
   raw attributes, only 14 of them are actually used.  Thus I've taken the
   liberty of making 2 copies of each database: one with all the attributes
   and 1 with the 14 attributes actually used in past experiments.

   The authors of the databases have requested:

      ...that any publications resulting from the use of th

We can select columns from the dataframe by their column header

In [None]:
df['rest_blood_pressure']

0      145.0
1      160.0
2      120.0
3      130.0
4      130.0
       ...  
298    110.0
299    144.0
300    130.0
301    130.0
302    138.0
Name: rest_blood_pressure, Length: 303, dtype: float64

And we can also select rows by their index

In [None]:
print(df.loc[302])
print(df_small.loc['a'])
print(df_small.iloc[1])

age                         38.0
sex                          1.0
chest_pain_type              3.0
rest_blood_pressure        138.0
serum_cholesterol          175.0
fasting_blood_sugar          0.0
rest_ecg                     0.0
max_HR_achieved            173.0
exercise_induced_angina      0.0
ST_depression                0.0
ST_slope                     1.0
vessels_fluoroscopy            ?
thal                         3.0
heart_disease                  0
Name: 302, dtype: object
col1    1.0
col2    5.0
Name: a, dtype: float64
col1    2.0
col2    6.0
Name: b, dtype: float64


We can add new columns, containing constant values or containing Series (if they have a different number of elements these will be considered NaN)


In [None]:
df['new_col']=5
df['new_col2']=pd.Series([*range(300)])
df


Unnamed: 0,age,sex,chest_pain_type,rest_blood_pressure,serum_cholesterol,fasting_blood_sugar,rest_ecg,max_HR_achieved,exercise_induced_angina,ST_depression,ST_slope,vessels_fluoroscopy,thal,heart_disease,new_col,new_col2
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0,5,0.0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2,5,1.0
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1,5,2.0
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0,5,3.0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0,5,4.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,45.0,1.0,1.0,110.0,264.0,0.0,0.0,132.0,0.0,1.2,2.0,0.0,7.0,1,5,298.0
299,68.0,1.0,4.0,144.0,193.0,1.0,0.0,141.0,0.0,3.4,2.0,2.0,7.0,2,5,299.0
300,57.0,1.0,4.0,130.0,131.0,0.0,0.0,115.0,1.0,1.2,2.0,1.0,7.0,3,5,
301,57.0,0.0,2.0,130.0,236.0,0.0,2.0,174.0,0.0,0.0,2.0,1.0,3.0,1,5,


We can also erase columns using their header (label)





In [None]:
del df['new_col']
df

Unnamed: 0,age,sex,chest_pain_type,rest_blood_pressure,serum_cholesterol,fasting_blood_sugar,rest_ecg,max_HR_achieved,exercise_induced_angina,ST_depression,ST_slope,vessels_fluoroscopy,thal,heart_disease,new_col2
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0,0.0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2,1.0
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1,2.0
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0,3.0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0,4.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,45.0,1.0,1.0,110.0,264.0,0.0,0.0,132.0,0.0,1.2,2.0,0.0,7.0,1,298.0
299,68.0,1.0,4.0,144.0,193.0,1.0,0.0,141.0,0.0,3.4,2.0,2.0,7.0,2,299.0
300,57.0,1.0,4.0,130.0,131.0,0.0,0.0,115.0,1.0,1.2,2.0,1.0,7.0,3,
301,57.0,0.0,2.0,130.0,236.0,0.0,2.0,174.0,0.0,0.0,2.0,1.0,3.0,1,


And we can apply different functions to a subgroup of columns. If these are boolean functions, we can use them to filter the elements of the dataframe

In [None]:
filtro=np.greater_equal(df.ST_depression,df.ST_slope)
df[filtro]


Unnamed: 0,age,sex,chest_pain_type,rest_blood_pressure,serum_cholesterol,fasting_blood_sugar,rest_ecg,max_HR_achieved,exercise_induced_angina,ST_depression,ST_slope,vessels_fluoroscopy,thal,heart_disease,new_col2
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1,2.0
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0,3.0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0,4.0
6,62.0,0.0,4.0,140.0,268.0,0.0,2.0,160.0,0.0,3.6,3.0,2.0,3.0,3,6.0
9,53.0,1.0,4.0,140.0,203.0,1.0,2.0,155.0,1.0,3.1,3.0,0.0,7.0,1,9.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
285,58.0,1.0,4.0,114.0,318.0,0.0,1.0,140.0,0.0,4.4,3.0,3.0,6.0,4,285.0
286,58.0,0.0,4.0,170.0,225.0,1.0,2.0,146.0,1.0,2.8,2.0,2.0,6.0,2,286.0
291,55.0,0.0,2.0,132.0,342.0,0.0,0.0,166.0,0.0,1.2,1.0,0.0,3.0,0,291.0
293,63.0,1.0,4.0,140.0,187.0,0.0,2.0,144.0,1.0,4.0,1.0,2.0,7.0,2,293.0


In [None]:
filtro=np.less_equal(df.rest_blood_pressure,110)
df[filtro]

Unnamed: 0,age,sex,chest_pain_type,rest_blood_pressure,serum_cholesterol,fasting_blood_sugar,rest_ecg,max_HR_achieved,exercise_induced_angina,ST_depression,ST_slope,vessels_fluoroscopy,thal,heart_disease,new_col2
16,48.0,1.0,2.0,110.0,229.0,0.0,0.0,168.0,0.0,1.0,3.0,0.0,7.0,1,16.0
20,64.0,1.0,1.0,110.0,211.0,0.0,2.0,144.0,1.0,1.8,2.0,0.0,3.0,0,20.0
29,40.0,1.0,4.0,110.0,167.0,0.0,2.0,114.0,1.0,2.0,2.0,0.0,7.0,3,29.0
46,51.0,1.0,3.0,110.0,175.0,0.0,0.0,123.0,0.0,0.6,1.0,0.0,3.0,0,46.0
50,41.0,0.0,2.0,105.0,198.0,0.0,0.0,168.0,0.0,0.0,1.0,1.0,3.0,0,50.0
57,41.0,1.0,4.0,110.0,172.0,0.0,2.0,158.0,0.0,0.0,1.0,0.0,7.0,1,57.0
73,65.0,1.0,4.0,110.0,248.0,0.0,2.0,158.0,0.0,0.6,1.0,2.0,6.0,1,73.0
74,44.0,1.0,4.0,110.0,197.0,0.0,2.0,177.0,0.0,0.0,1.0,1.0,3.0,1,74.0
80,45.0,1.0,4.0,104.0,208.0,0.0,2.0,148.0,1.0,3.0,2.0,0.0,3.0,0,80.0
93,44.0,0.0,3.0,108.0,141.0,0.0,0.0,175.0,0.0,0.6,2.0,0.0,3.0,0,93.0


We can also transpose the columns and rows of a dataframe, maybe for easeness of visualization

In [None]:
df.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,293,294,295,296,297,298,299,300,301,302
age,63.0,67.0,67.0,37.0,41.0,56.0,62.0,57.0,63.0,53.0,...,63.0,63.0,41.0,59.0,57.0,45.0,68.0,57.0,57.0,38.0
sex,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,...,1.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0
chest_pain_type,1.0,4.0,4.0,3.0,2.0,2.0,4.0,4.0,4.0,4.0,...,4.0,4.0,2.0,4.0,4.0,1.0,4.0,4.0,2.0,3.0
rest_blood_pressure,145.0,160.0,120.0,130.0,130.0,120.0,140.0,120.0,130.0,140.0,...,140.0,124.0,120.0,164.0,140.0,110.0,144.0,130.0,130.0,138.0
serum_cholesterol,233.0,286.0,229.0,250.0,204.0,236.0,268.0,354.0,254.0,203.0,...,187.0,197.0,157.0,176.0,241.0,264.0,193.0,131.0,236.0,175.0
fasting_blood_sugar,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
rest_ecg,2.0,2.0,2.0,0.0,2.0,0.0,2.0,0.0,2.0,2.0,...,2.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,2.0,0.0
max_HR_achieved,150.0,108.0,129.0,187.0,172.0,178.0,160.0,163.0,147.0,155.0,...,144.0,136.0,182.0,90.0,123.0,132.0,141.0,115.0,174.0,173.0
exercise_induced_angina,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,...,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
ST_depression,2.3,1.5,2.6,3.5,1.4,0.8,3.6,0.6,1.4,3.1,...,4.0,0.0,0.0,1.0,0.2,1.2,3.4,1.2,0.0,0.0


If we want to sort the elements of the dataframe in a different order we can use sort_values or sort_index. In both cases, pandas assumes an ascending order by default, so we have to tell it if we don't want it that way

In [None]:
df.sort_index(ascending=False)

Unnamed: 0,age,sex,chest_pain_type,rest_blood_pressure,serum_cholesterol,fasting_blood_sugar,rest_ecg,max_HR_achieved,exercise_induced_angina,ST_depression,ST_slope,vessels_fluoroscopy,thal,heart_disease,new_col2
302,38.0,1.0,3.0,138.0,175.0,0.0,0.0,173.0,0.0,0.0,1.0,?,3.0,0,
301,57.0,0.0,2.0,130.0,236.0,0.0,2.0,174.0,0.0,0.0,2.0,1.0,3.0,1,
300,57.0,1.0,4.0,130.0,131.0,0.0,0.0,115.0,1.0,1.2,2.0,1.0,7.0,3,
299,68.0,1.0,4.0,144.0,193.0,1.0,0.0,141.0,0.0,3.4,2.0,2.0,7.0,2,299.0
298,45.0,1.0,1.0,110.0,264.0,0.0,0.0,132.0,0.0,1.2,2.0,0.0,7.0,1,298.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0,4.0
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0,3.0
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1,2.0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2,1.0


In [None]:
df_small.sort_index(ascending=False)

Unnamed: 0,col1,col2
d,,8.0
c,3.0,
b,2.0,6.0
a,1.0,5.0


the function "groupby" lets us group the elements of the dataframe based on the values of a given column, or group of columns

In [None]:
df.groupby(by='heart_disease')['heart_disease'].count()

heart_disease
0    164
1     55
2     36
3     35
4     13
Name: heart_disease, dtype: int64

In [None]:
df.groupby(by=['heart_disease','sex'])['age'].count().reset_index(name='counts')


Unnamed: 0,heart_disease,sex,counts
0,0,0.0,72
1,0,1.0,92
2,1,0.0,9
3,1,1.0,46
4,2,0.0,7
5,2,1.0,29
6,3,0.0,7
7,3,1.0,28
8,4,0.0,2
9,4,1.0,11


We can also apply functions (anonymous or otherwise) to all the elements of the dataframe, but we have to be careful to check if the data types are compatible with a given function

In [None]:
df.select_dtypes(include=['float64','int64']).apply(np.sqrt)

Unnamed: 0,age,sex,chest_pain_type,rest_blood_pressure,serum_cholesterol,fasting_blood_sugar,rest_ecg,max_HR_achieved,exercise_induced_angina,ST_depression,ST_slope,heart_disease,new_col2
0,7.937254,1.0,1.000000,12.041595,15.264338,1.0,1.414214,12.247449,0.0,1.516575,1.732051,0.000000,0.000000
1,8.185353,1.0,2.000000,12.649111,16.911535,0.0,1.414214,10.392305,1.0,1.224745,1.414214,1.414214,1.000000
2,8.185353,1.0,2.000000,10.954451,15.132746,0.0,1.414214,11.357817,1.0,1.612452,1.414214,1.000000,1.414214
3,6.082763,1.0,1.732051,11.401754,15.811388,0.0,0.000000,13.674794,0.0,1.870829,1.732051,0.000000,1.732051
4,6.403124,0.0,1.414214,11.401754,14.282857,0.0,1.414214,13.114877,0.0,1.183216,1.000000,0.000000,2.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,6.708204,1.0,1.000000,10.488088,16.248077,0.0,0.000000,11.489125,0.0,1.095445,1.414214,1.000000,17.262677
299,8.246211,1.0,2.000000,12.000000,13.892444,1.0,0.000000,11.874342,0.0,1.843909,1.414214,1.414214,17.291616
300,7.549834,1.0,2.000000,11.401754,11.445523,0.0,0.000000,10.723805,1.0,1.095445,1.414214,1.732051,
301,7.549834,0.0,1.414214,11.401754,15.362291,0.0,1.414214,13.190906,0.0,0.000000,1.414214,1.000000,


In [None]:
df.select_dtypes(include=['float64','int64']).apply(lambda x:x*3)

Unnamed: 0,age,sex,chest_pain_type,rest_blood_pressure,serum_cholesterol,fasting_blood_sugar,rest_ecg,max_HR_achieved,exercise_induced_angina,ST_depression,ST_slope,heart_disease,new_col2
0,189.0,3.0,3.0,435.0,699.0,3.0,6.0,450.0,0.0,6.9,9.0,0,0.0
1,201.0,3.0,12.0,480.0,858.0,0.0,6.0,324.0,3.0,4.5,6.0,6,3.0
2,201.0,3.0,12.0,360.0,687.0,0.0,6.0,387.0,3.0,7.8,6.0,3,6.0
3,111.0,3.0,9.0,390.0,750.0,0.0,0.0,561.0,0.0,10.5,9.0,0,9.0
4,123.0,0.0,6.0,390.0,612.0,0.0,6.0,516.0,0.0,4.2,3.0,0,12.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,135.0,3.0,3.0,330.0,792.0,0.0,0.0,396.0,0.0,3.6,6.0,3,894.0
299,204.0,3.0,12.0,432.0,579.0,3.0,0.0,423.0,0.0,10.2,6.0,6,897.0
300,171.0,3.0,12.0,390.0,393.0,0.0,0.0,345.0,3.0,3.6,6.0,9,
301,171.0,0.0,6.0,390.0,708.0,0.0,6.0,522.0,0.0,0.0,6.0,3,


We can display some dataframe's statistics my using describe()

In [None]:
df.describe()

Unnamed: 0,age,sex,chest_pain_type,rest_blood_pressure,serum_cholesterol,fasting_blood_sugar,rest_ecg,max_HR_achieved,exercise_induced_angina,ST_depression,ST_slope,heart_disease,new_col2
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,300.0
mean,54.438944,0.679868,3.158416,131.689769,246.693069,0.148515,0.990099,149.607261,0.326733,1.039604,1.60066,0.937294,149.5
std,9.038662,0.467299,0.960126,17.599748,51.776918,0.356198,0.994971,22.875003,0.469794,1.161075,0.616226,1.228536,86.746758
min,29.0,0.0,1.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,1.0,0.0,0.0
25%,48.0,0.0,3.0,120.0,211.0,0.0,0.0,133.5,0.0,0.0,1.0,0.0,74.75
50%,56.0,1.0,3.0,130.0,241.0,0.0,1.0,153.0,0.0,0.8,2.0,0.0,149.5
75%,61.0,1.0,4.0,140.0,275.0,0.0,2.0,166.0,1.0,1.6,2.0,2.0,224.25
max,77.0,1.0,4.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,3.0,4.0,299.0


Or we can directly ask for some statistic value...

In [None]:
df.mean()

  """Entry point for launching an IPython kernel.


age                         54.438944
sex                          0.679868
chest_pain_type              3.158416
rest_blood_pressure        131.689769
serum_cholesterol          246.693069
fasting_blood_sugar          0.148515
rest_ecg                     0.990099
max_HR_achieved            149.607261
exercise_induced_angina      0.326733
ST_depression                1.039604
ST_slope                     1.600660
heart_disease                0.937294
new_col2                   149.500000
dtype: float64

If we want to save the dataframe to a file, we can use the to_csv function

In [None]:
df2=df.select_dtypes(include=['float64','int64']).apply(lambda x:x%2)
df2.to_csv('df2.csv')

#Introduction to data visualization


We can also import datasets directly in excel format

In [None]:
df_hmcia= pd.read_excel('datahmcia.xlsx', sheet_name='data')

FileNotFoundError: ignored

In [None]:
df

In [None]:
df_hmcia

Let's check the answer times for every answer and question

In [None]:
df_hmcia.iloc[:,21:41]

We can change these times so they represent time in seconds

In [None]:
for x in range(0,len(df_hmcia)): 
  for y in range(21,41):
    t_q=int(df_hmcia.iloc[x,y].second + df_hmcia.iloc[x,y].minute*60 + df_hmcia.iloc[x,y].hour*3600)
    df_hmcia.iloc[x,y]=t_q
display(df_hmcia.iloc[:,21:41])

## Matplotlib

This package let us create different kinds of plots

In [None]:
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

We can draw line plots

In [None]:
y=df_hmcia.iloc[1,21:41]
plt.plot(y)
y2=df_hmcia.iloc[2,21:41]
plt.plot(y2)
y3=df_hmcia.iloc[3,21:41]
plt.plot(y3)

We can also change the markers and colors

In [None]:
plt.plot(y,'or')
plt.plot(y2,'xb')
plt.plot(y3,'.y')

In [None]:
x=np.array([*range(1,21)])
plt.plot(x,y,'--',linewidth=1)
print(y.shape)

Some plots need to be resized, and it's useful to include a legend

In [None]:
plt.rcParams['figure.figsize'] = [20, 5]
plt.xticks(np.arange(min(x), max(x)+1, 1.0))
plt.plot(x,y,'--',linewidth=3,label='s1')
plt.legend(bbox_to_anchor =(0,1.1))

Histogram let us see the frequency of certain values

In [None]:
plt.hist(df_hmcia.E10)
matplotlib.rcParams.update(matplotlib.rcParamsDefault)
plt.xticks(np.arange(min(df_hmcia.E10), max(df_hmcia.E10)+1, 1.0))

In [None]:
plt.hist(df_hmcia.TQ10)

Some times, we want to take into account several columns at the same time

In [None]:
question_times=np.array(df_hmcia.iloc[:,21:41])
question_times=question_times.flatten()
plt.hist(question_times, range=(0,30),color='red')

Bar charts are useful to compare several subjects with regard to one variable

In [None]:
nota=5*(np.sum(df_hmcia.iloc[:,1:21],axis=1)/20)
labels=df_hmcia.Estudiante
plt.xticks(range(len(nota)), labels)
plt.yticks([0,1,2,3,4,5])
plt.xlabel('Estudiante')
plt.ylabel('Nota del quiz')
plt.title('Notas de los estudiantes en el quiz')
plt.bar(range(len(nota)), nota) 
plt.rcParams['figure.figsize'] = [15, 5]
plt.show()


In [None]:
nota=5*(np.sum(df_hmcia.iloc[:,1:21],axis=1)/20)
print(nota)

In [None]:
autoconfianza=5*np.mean(df_hmcia.iloc[:,41:65],axis=1)/2
labels=df_hmcia.Estudiante
plt.xticks(range(len(autoconfianza)), labels)
plt.yticks([0,1,2,3,4,5])
plt.xlabel('Estudiante')
plt.ylabel('Autoconfianza')
plt.title('Autoconfianza de los estudiantes en los temas de Python')
plt.bar(range(len(autoconfianza)), autoconfianza, color='red') 
plt.rcParams['figure.figsize'] = [15, 5]
plt.show()

And we can also compare several variables 

In [None]:
#plt.bar(x_axis -0.2, female, width=0.4, label = 'Female')
#plt.bar(x_axis +0.2, male, width=0.4, label = 'Male')

width =0.3
plt.bar(np.arange(len(nota)), nota, width=width, label='nota') 
plt.bar(np.arange(len(autoconfianza))+width, autoconfianza, color='red', width=width, label='autoconfianza') 
labels=df_hmcia.Estudiante
plt.xticks(range(len(autoconfianza)), labels)
plt.legend()
plt.show()

