# Python for Data Analysis


## Curso Introducción a Python - Tecnun, Universidad de Navarra

Fernando Carazo


Website: [rcs.bu.edu](http://www.bu.edu/tech/support/research/) <br>

In [22]:
#Import Python Libraries
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Pandas is a python package that deals mostly with :
- **Series**  (1d homogeneous array)
- **DataFrame** (2d labeled heterogeneous array) 
- **Panel** (general 3d array)

### Pandas Series

Pandas *Series* is one-dimentional labeled array containing data of the same type (integers, strings, floating point numbers, Python objects, etc. ). The axis labels are often referred to as *index*.

In [24]:
# Example of creating Pandas series :
s1 = pd.Series([1,2,3,4,5,6,7])

We did not pass any index, so by default, it assigned the indexes ranging from 0 to len(data)-1

In [26]:
# View index values
s1.index

RangeIndex(start=0, stop=7, step=1)

In [28]:
# Creating Pandas series with index:
s1 = pd.Series([1,2,3], index=["a","b","c"])

In [30]:
# View index values
s1.index

Index(['a', 'b', 'c'], dtype='object')

### Pandas DataFrame

Pandas *DataFrame* is two-dimensional, size-mutable, heterogeneous tabular data structure with labeled rows and columns ( axes ). Can be thought of a dictionary-like container to store python Series objects.

In [32]:
dic = {"Name":["Alice", "Bob", "Chris"], "Age" : [21,25,23]}
print(dic)

{'Name': ['Alice', 'Bob', 'Chris'], 'Age': [21, 25, 23]}


In [34]:
d = pd.DataFrame(dic)
d

Unnamed: 0,Name,Age
0,Alice,21
1,Bob,25
2,Chris,23


In [36]:
#Add a new column:
d["Height"] = [5.2,6.0,5.6]
d

Unnamed: 0,Name,Age,Height
0,Alice,21,5.2
1,Bob,25,6.0
2,Chris,23,5.6


In [38]:
#Read csv file
df = pd.read_csv("./data/Salaries.csv")

In [40]:
#Display a few first records
df.head(10)

Unnamed: 0,rank,discipline,phd,service,sex,salary
0,Prof,B,56,49,Male,186960
1,Prof,A,12,6,Male,93000
2,Prof,A,23,20,Male,110515
3,Prof,A,40,31,Male,131205
4,Prof,B,20,18,Male,104800
5,Prof,A,20,20,Male,122400
6,AssocProf,A,20,17,Male,81285
7,Prof,A,18,18,Male,126300
8,Prof,A,29,19,Male,94350
9,Prof,A,51,51,Male,57800


![image.png](img/dfAtributeS.PNG)

In [42]:
#Identify the type of df object
df.dtypes

rank          object
discipline    object
phd            int64
service        int64
sex           object
salary         int64
dtype: object

In [44]:
#Check the type of a column "salary"
df["salary"].dtype


dtype('int64')

In [46]:
#List the types of all columns


In [48]:
#List the column names
df.columns

Index(['rank', 'discipline', 'phd', 'service', 'sex', 'salary'], dtype='object')

In [50]:
#List the row labels and the column names
df.index

RangeIndex(start=0, stop=78, step=1)

In [52]:
df.axes

[RangeIndex(start=0, stop=78, step=1),
 Index(['rank', 'discipline', 'phd', 'service', 'sex', 'salary'], dtype='object')]

In [54]:
#Number of dimensions
df.ndim

2

In [56]:
#Total number of elements in the Data Frame
df.size

468

In [58]:
#Number of rows and columns
df.shape

(78, 6)

![image.png](img/dfMethods.PNG)

In [60]:
#Output basic statistics for the numeric columns
df.describe()


Unnamed: 0,phd,service,salary
count,78.0,78.0,78.0
mean,19.705128,15.051282,108023.782051
std,12.498425,12.139768,28293.661022
min,1.0,0.0,57800.0
25%,10.25,5.25,88612.5
50%,18.5,14.5,104671.0
75%,27.75,20.75,126774.75
max,56.0,51.0,186960.0


In [62]:
#Calculate mean for all numeric columns
df.mean()

phd            19.705128
service        15.051282
salary     108023.782051
dtype: float64

---
*Exercise* 

In [64]:
#Display first 20 records
# <your code goes here>
df.head(20)

Unnamed: 0,rank,discipline,phd,service,sex,salary
0,Prof,B,56,49,Male,186960
1,Prof,A,12,6,Male,93000
2,Prof,A,23,20,Male,110515
3,Prof,A,40,31,Male,131205
4,Prof,B,20,18,Male,104800
5,Prof,A,20,20,Male,122400
6,AssocProf,A,20,17,Male,81285
7,Prof,A,18,18,Male,126300
8,Prof,A,29,19,Male,94350
9,Prof,A,51,51,Male,57800


In [66]:
#Find how many records this data frame has;
# <your code goes here>
len(df)

78

In [68]:
#Display the last 5 records
# <your code goes here>
df.tail(5)

Unnamed: 0,rank,discipline,phd,service,sex,salary
73,Prof,B,18,10,Female,105450
74,AssocProf,B,19,6,Female,104542
75,Prof,B,17,17,Female,124312
76,Prof,A,28,14,Female,109954
77,Prof,A,23,15,Female,109646


In [70]:
#Calculate the standard deviation (std() method) for all numeric columns
# <your code goes here>
df.std()

phd           12.498425
service       12.139768
salary     28293.661022
dtype: float64

In [72]:
#Calculate average of the columns in the first 50 rows
# <your code goes here>
print(df[:50].mean())

phd            21.52
service        17.60
salary     113789.14
dtype: float64


---
### Data slicing and grouping

In [74]:
#Extract a column by name (method 1)
print(df["sex"])
#o bien
df.sex

0       Male
1       Male
2       Male
3       Male
4       Male
       ...  
73    Female
74    Female
75    Female
76    Female
77    Female
Name: sex, Length: 78, dtype: object


0       Male
1       Male
2       Male
3       Male
4       Male
       ...  
73    Female
74    Female
75    Female
76    Female
77    Female
Name: sex, Length: 78, dtype: object

---
*Exercise* 

---

In [76]:
#Group data using rank
#Importante
set(df["rank"])

{'AssocProf', 'AsstProf', 'Prof'}

In [78]:
#Calculate mean of all numeric columns for the grouped object
df.groupby(["rank"]).mean()

Unnamed: 0_level_0,phd,service,salary
rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AssocProf,15.076923,11.307692,91786.230769
AsstProf,5.052632,2.210526,81362.789474
Prof,27.065217,21.413043,123624.804348


In [80]:
df.groupby(["sex"]).mean()

Unnamed: 0_level_0,phd,service,salary
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Female,16.512821,11.564103,101002.410256
Male,22.897436,18.538462,115045.153846


In [82]:
#Calculate the mean salary for men and women. The following produce Pandas Series (single brackets around salary)
df.groupby(["sex"])["salary"].mean()

sex
Female    101002.410256
Male      115045.153846
Name: salary, dtype: float64

In [84]:
# If we use double brackets Pandas will produce a DataFrame
df.groupby(["sex"])[["salary"]].mean()

Unnamed: 0_level_0,salary
sex,Unnamed: 1_level_1
Female,101002.410256
Male,115045.153846


In [86]:
# Group using 2 variables - sex and rank:
df.groupby(["rank","sex"])[["salary"]].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,salary
rank,sex,Unnamed: 2_level_1
AssocProf,Female,88512.8
AssocProf,Male,102697.666667
AsstProf,Female,78049.909091
AsstProf,Male,85918.0
Prof,Female,121967.611111
Prof,Male,124690.142857


---
*Exercise* 

In [88]:
#Calculate the basic statistics for the salary column (used describe() method)
# <your code goes here>

In [90]:
#Calculate how many values in the salary column (use count() method)
# <your code goes here>


In [92]:
# Group data by the discipline and find the average salary for each group

---
### Filtering

In [143]:
#Select observation with the value in the salary column > 120K
df[df["salary"] > 120000]

Unnamed: 0,rank,discipline,phd,service,sex,salary
0,Prof,B,56,49,Male,186960
3,Prof,A,40,31,Male,131205
5,Prof,A,20,20,Male,122400
7,Prof,A,18,18,Male,126300
10,Prof,B,39,33,Male,128250
11,Prof,B,23,23,Male,134778
13,Prof,B,35,33,Male,162200
14,Prof,B,25,19,Male,153750
15,Prof,B,17,3,Male,150480
19,Prof,A,29,27,Male,150500


In [140]:
df.loc[df["salary"] > 20000, ["rank", "discipline"]]

Unnamed: 0,rank,discipline
0,Prof,B
1,Prof,A
2,Prof,A
3,Prof,A
4,Prof,B
...,...,...
73,Prof,B
74,AssocProf,B
75,Prof,B
76,Prof,A


In [96]:
#Select data for female professors


---
### More on slicing the dataset

In [98]:
#Select column salary


In [100]:
#Check data type of the result


In [102]:
#Look at the first few elements of the output


In [104]:
#Select column salary and make the output to be a data frame


In [106]:
#Check the type


In [108]:
#Select a subset of rows (based on their position):
# Note 1: The location of the first row is 0
# Note 2: The last value in the range is not included


In [110]:
#If we want to select both rows and columns we can use method .loc


Unnamed: 0,salary
1,93000
2,110515
3,131205
4,104800
5,122400
6,81285
7,126300
8,94350
9,57800
10,128250


In [112]:
#Let's see what we get for our df_sub data frame
# Method .loc subset the data frame based on the labels:


In [145]:
#  Unlike method .loc, method iloc selects rows (and columns) by poistion:
df.iloc[0:5, 0:3] #Nos deja elegir en base a dos indices

Unnamed: 0,rank,discipline,phd
0,Prof,B,56
1,Prof,A,12
2,Prof,A,23
3,Prof,A,40
4,Prof,B,20


### Sorting the Data

In [150]:
#Sort the data frame by yrs.service and create a new data frame
df.sort_values(["phd", "discipline"], ascending=[False, True])

Unnamed: 0,rank,discipline,phd,service,sex,salary
0,Prof,B,56,49,Male,186960
9,Prof,A,51,51,Male,57800
27,Prof,A,45,43,Male,155865
36,Prof,B,45,45,Male,146856
3,Prof,A,40,31,Male,131205
...,...,...,...,...,...,...
57,AsstProf,A,3,1,Female,72500
60,AsstProf,B,3,3,Female,92000
23,AsstProf,A,2,0,Male,85000
55,AsstProf,A,2,0,Female,72500


In [153]:
#Sort the data frame by yrs.service and overwrite the original dataset
df.sort_values(["discipline","phd"], ascending=[False, True], inplace=True) #El inplace true nos cambia el df original
df

Unnamed: 0,rank,discipline,phd,service,sex,salary
12,AsstProf,B,1,0,Male,88000
60,AsstProf,B,3,3,Female,92000
17,AsstProf,B,4,0,Male,92000
20,AsstProf,B,4,4,Male,92000
38,AsstProf,B,4,3,Male,91000
...,...,...,...,...,...,...
26,Prof,A,38,19,Male,148750
40,Prof,A,39,36,Female,137000
3,Prof,A,40,31,Male,131205
27,Prof,A,45,43,Male,155865


In [155]:
# Restore the original order (by sorting using index)
df.sort_index(inplace=True)
df

Unnamed: 0,rank,discipline,phd,service,sex,salary
0,Prof,B,56,49,Male,186960
1,Prof,A,12,6,Male,93000
2,Prof,A,23,20,Male,110515
3,Prof,A,40,31,Male,131205
4,Prof,B,20,18,Male,104800
...,...,...,...,...,...,...
73,Prof,B,18,10,Female,105450
74,AssocProf,B,19,6,Female,104542
75,Prof,B,17,17,Female,124312
76,Prof,A,28,14,Female,109954


---
*Exercise* 

In [159]:
# Using filtering, find the mean value of the salary for the discipline A
df[df["discipline"] == "A"].mean()

phd           21.527778
service       15.722222
salary     98331.111111
dtype: float64

In [166]:
# Challange:
# Extract (filter) only observations with high salary ( > 100K) and find how many female and male professors in each group
df[(df["salary"] > 100000) & (df["rank"] == "Prof")].groupby("sex")["rank"].count()

sex
Female    16
Male      23
Name: rank, dtype: int64

In [167]:
# Sort data frame by the salary (in descending order) and display the first few records of the output (head)
df.sort_values("salary", ascending=False).head()

Unnamed: 0,rank,discipline,phd,service,sex,salary
0,Prof,B,56,49,Male,186960
13,Prof,B,35,33,Male,162200
72,Prof,B,24,15,Female,161101
27,Prof,A,45,43,Male,155865
31,Prof,B,22,21,Male,155750


---

In [168]:
#Sort the data frame using 2 or more columns:
df.sort_values(["rank", "discipline"], ascending=[False, True])

Unnamed: 0,rank,discipline,phd,service,sex,salary
1,Prof,A,12,6,Male,93000
2,Prof,A,23,20,Male,110515
3,Prof,A,40,31,Male,131205
5,Prof,A,20,20,Male,122400
7,Prof,A,18,18,Male,126300
...,...,...,...,...,...,...
59,AssocProf,B,12,10,Female,103994
61,AssocProf,B,13,10,Female,103750
62,AssocProf,B,14,7,Female,109650
71,AssocProf,B,12,9,Female,71065


### Missing Values

In [180]:
# Read a dataset with missing values data/flights.csv
flights = pd.DataFrame()
flights = pd.read_csv("./data/flights.csv")
flights.head()

Unnamed: 0,year,month,day,dep_time,dep_delay,arr_time,arr_delay,carrier,tailnum,flight,origin,dest,air_time,distance,hour,minute
0,2013,1,1,517.0,2.0,830.0,11.0,UA,N14228,1545,EWR,IAH,227.0,1400,5.0,17.0
1,2013,1,1,533.0,4.0,850.0,20.0,UA,N24211,1714,LGA,IAH,227.0,1416,5.0,33.0
2,2013,1,1,542.0,2.0,923.0,33.0,AA,N619AA,1141,JFK,MIA,160.0,1089,5.0,42.0
3,2013,1,1,554.0,-6.0,812.0,-25.0,DL,N668DN,461,LGA,ATL,116.0,762,5.0,54.0
4,2013,1,1,554.0,-4.0,740.0,12.0,UA,N39463,1696,EWR,ORD,150.0,719,5.0,54.0



>**Hands-on exercises**
>
>- Find how many records this data frame has;
>
>- How many elements are there?
>
>- What are the column names?
>
>- What types of columns we have in this data frame?

In [181]:
#Find how many records this data frame has
flights.shape

(160754, 16)

In [182]:
#How many elements are ther
flights.size

2572064

![](https://predictivehacks.com/wp-content/uploads/2020/08/numpy_arrays-1024x572.png)

In [125]:
df_aux = pd.DataFrame({"A": [1,2,3], "B": [1,2,3]})


In [126]:
# Select the rows that have at least one missing value


In [127]:
# Filter all the rows where arr_delay value is missing:



In [128]:
# Remove all the observations with missing values


In [129]:
# Fill missing values with zeros


---
*Exercise* 

In [130]:
# Count how many missing data are in dep_delay and arr_delay columns


---
### Common Aggregation Functions:

|Function|Description
|-------|--------
|min   | minimum
|max   | maximum
|count   | number of non-null observations
|sum   | sum of values
|mean  | arithmetic mean of values
|median | median
|mad | mean absolute deviation
|mode | mode
|prod   | product of values
|std  | standard deviation
|var | unbiased variance



In [131]:
# Find the number of non-missing values in each column


In [132]:
# Find min value for all the columns in the dataset


In [133]:
# Let's compute summary statistic per a group':


In [134]:
# We can use agg() methods for aggregation:


In [135]:
# An example of computing different statistics for different columns


### Basic descriptive statistics

|Function|Description
|-------|--------
|min   | minimum
|max   | maximum
|mean  | arithmetic mean of values
|median | median
|mad | mean absolute deviation
|mode | mode
|std  | standard deviation
|var | unbiased variance
|sem | standard error of the mean
|skew| sample skewness
|kurt|kurtosis
|quantile| value at %


In [136]:
# Convinient describe() function computes a veriety of statistics


In [137]:
# find the index of the maximum or minimum value
# if there are multiple values matching idxmin() and idxmax() will return the first match


In [138]:
# Count the number of records for each different value in a vector
