## Facilitator

Ali Haider - Technical Lead - GDSC Iowa State University

### Links

- LinkedIn: https://www.linkedin.com/in/m-ali-haider/
- Email: mhaider@iastate.edu
- Github: https://github.com/allihaider
- Medium: https://medium.com/@m.allihaider


---

## Topics
1. Series 
  * Creation
  * Attributes
  * Indexing
2. Dataframe
  * Creation
  * Attributes
  * Indexing
3. Operations
  * Adding and Removing columns
  * Renaming Columns and Indices
  * Summary Functions
  * Maps
  * Groupby
  * Sorting
  * Handling Data Types and Missing Value
  * Combining Dataframes


In [2]:
import pandas as pd

## Pandas Data Structures


### Series


An array-like object containing a sequence of values and associated labels.

### Creation

In [3]:
series_object = pd.Series([5, 6, 7, 8, 9])

In [4]:
series_object

0    5
1    6
2    7
3    8
4    9
dtype: int64

### Attributes

In [5]:
series_object.values

array([5, 6, 7, 8, 9])

In [6]:
series_object.index

RangeIndex(start=0, stop=5, step=1)

In [7]:
# series_object.name
# series_object.index.name

### Indexing

In [8]:
another_series_object = pd.Series([5, 6, 7, 8, 9], index=["a", "b", "c", "d", "e"])

In [9]:
another_series_object

a    5
b    6
c    7
d    8
e    9
dtype: int64

In [10]:
another_series_object["b"]

6

In [11]:
another_series_object[["a", "d", "e"]]

a    5
d    8
e    9
dtype: int64

### DataFrame

A table of data containing an ordered collection of columns

OR

A dictionary of series sharing the same indices.

### Creation

In [12]:
data_dict = {
    "Classes" : ["AI", "Stats", "Thermodynamics"],
    "Schools" : ["CS", "Maths", "Engineering"],
    "Seniority": ["Junior", "Senior", "Sophomore"]    
    }

In [13]:
df = pd.DataFrame(data_dict)

In [14]:
# index assigned automatically
df

Unnamed: 0,Classes,Schools,Seniority
0,AI,CS,Junior
1,Stats,Maths,Senior
2,Thermodynamics,Engineering,Sophomore


### Attributes

In [15]:
df.index

RangeIndex(start=0, stop=3, step=1)

In [16]:
df.columns

Index(['Classes', 'Schools', 'Seniority'], dtype='object')

### Indexing

In [17]:
df["Classes"]

0                AI
1             Stats
2    Thermodynamics
Name: Classes, dtype: object

In [18]:
df["Classes"][2]

'Thermodynamics'

In [19]:
df.Classes

0                AI
1             Stats
2    Thermodynamics
Name: Classes, dtype: object

In [20]:
df.loc[2]

Classes      Thermodynamics
Schools         Engineering
Seniority         Sophomore
Name: 2, dtype: object

In [21]:
data_dict = {
    "Classes" : ["AI", "Stats", "Thermodynamics"],
    "Schools": ["CS", "Maths", "Engineering"],
    "Seniority": ["Junior", "Senior", "Sophomore"]    
    }

In [22]:
df = pd.DataFrame(data_dict, index=["a", "b", "c"])

In [23]:
df

Unnamed: 0,Classes,Schools,Seniority
a,AI,CS,Junior
b,Stats,Maths,Senior
c,Thermodynamics,Engineering,Sophomore


In [24]:
df.loc["b"]

Classes       Stats
Schools       Maths
Seniority    Senior
Name: b, dtype: object

In [27]:
# df.loc[1]

In [28]:
df.iloc[1]

Classes       Stats
Schools       Maths
Seniority    Senior
Name: b, dtype: object

In [29]:
df.iloc[:, 0]

a                AI
b             Stats
c    Thermodynamics
Name: Classes, dtype: object

In [30]:
df[[True, False, True]]

Unnamed: 0,Classes,Schools,Seniority
a,AI,CS,Junior
c,Thermodynamics,Engineering,Sophomore


## Operations on Dataframes

### Adding and Removing columns

In [32]:
df["Grade"] = ["A", None, "A-"]

In [33]:
df

Unnamed: 0,Classes,Schools,Seniority,Grade
a,AI,CS,Junior,A
b,Stats,Maths,Senior,
c,Thermodynamics,Engineering,Sophomore,A-


In [34]:
del df["Grade"]

In [35]:
df

Unnamed: 0,Classes,Schools,Seniority
a,AI,CS,Junior
b,Stats,Maths,Senior
c,Thermodynamics,Engineering,Sophomore


### Renaming columns and indices

In [36]:
df.rename(columns={'Classes': 'Subjects'})

Unnamed: 0,Subjects,Schools,Seniority
a,AI,CS,Junior
b,Stats,Maths,Senior
c,Thermodynamics,Engineering,Sophomore


In [37]:
df.rename(index={"a": 0, "b": 1, "c": 2})

Unnamed: 0,Classes,Schools,Seniority
0,AI,CS,Junior
1,Stats,Maths,Senior
2,Thermodynamics,Engineering,Sophomore


### Summary Functions

Link to dataset: https://www.kaggle.com/datasets/rajyellow46/wine-quality


In [38]:
wine_df = pd.read_csv("winequalityN.csv")

In [39]:
wine_df

Unnamed: 0,type,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,white,7.0,0.270,0.36,20.7,0.045,45.0,170.0,1.00100,3.00,0.45,8.8,6
1,white,6.3,0.300,0.34,1.6,0.049,14.0,132.0,0.99400,3.30,0.49,9.5,6
2,white,8.1,0.280,0.40,6.9,0.050,30.0,97.0,0.99510,3.26,0.44,10.1,6
3,white,7.2,0.230,0.32,8.5,0.058,47.0,186.0,0.99560,3.19,0.40,9.9,6
4,white,7.2,0.230,0.32,8.5,0.058,47.0,186.0,0.99560,3.19,0.40,9.9,6
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6492,red,6.2,0.600,0.08,2.0,0.090,32.0,44.0,0.99490,3.45,0.58,10.5,5
6493,red,5.9,0.550,0.10,2.2,0.062,39.0,51.0,0.99512,3.52,,11.2,6
6494,red,6.3,0.510,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6
6495,red,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,5


In [40]:
wine_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6497 entries, 0 to 6496
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   type                  6497 non-null   object 
 1   fixed acidity         6487 non-null   float64
 2   volatile acidity      6489 non-null   float64
 3   citric acid           6494 non-null   float64
 4   residual sugar        6495 non-null   float64
 5   chlorides             6495 non-null   float64
 6   free sulfur dioxide   6497 non-null   float64
 7   total sulfur dioxide  6497 non-null   float64
 8   density               6497 non-null   float64
 9   pH                    6488 non-null   float64
 10  sulphates             6493 non-null   float64
 11  alcohol               6497 non-null   float64
 12  quality               6497 non-null   int64  
dtypes: float64(11), int64(1), object(1)
memory usage: 660.0+ KB


In [41]:
wine_df.describe()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,6487.0,6489.0,6494.0,6495.0,6495.0,6497.0,6497.0,6497.0,6488.0,6493.0,6497.0,6497.0
mean,7.216579,0.339691,0.318722,5.444326,0.056042,30.525319,115.744574,0.994697,3.218395,0.531215,10.491801,5.818378
std,1.29675,0.164649,0.145265,4.758125,0.035036,17.7494,56.521855,0.002999,0.160748,0.148814,1.192712,0.873255
min,3.8,0.08,0.0,0.6,0.009,1.0,6.0,0.98711,2.72,0.22,8.0,3.0
25%,6.4,0.23,0.25,1.8,0.038,17.0,77.0,0.99234,3.11,0.43,9.5,5.0
50%,7.0,0.29,0.31,3.0,0.047,29.0,118.0,0.99489,3.21,0.51,10.3,6.0
75%,7.7,0.4,0.39,8.1,0.065,41.0,156.0,0.99699,3.32,0.6,11.3,6.0
max,15.9,1.58,1.66,65.8,0.611,289.0,440.0,1.03898,4.01,2.0,14.9,9.0


In [42]:
wine_df["type"].describe()

count      6497
unique        2
top       white
freq       4898
Name: type, dtype: object

In [43]:
wine_df.type.unique()

array(['white', 'red'], dtype=object)

In [44]:
wine_df["volatile acidity"].mean()

0.33969101556480197

### Maps

In [45]:
acidity_mean = wine_df["volatile acidity"].mean()
wine_df["volatile acidity"].map(lambda p: p - acidity_mean) # Point based

0      -0.069691
1      -0.039691
2      -0.059691
3      -0.109691
4      -0.109691
          ...   
6492    0.260309
6493    0.210309
6494    0.170309
6495    0.305309
6496   -0.029691
Name: volatile acidity, Length: 6497, dtype: float64

In [46]:
def threshold_points(row):
  row["volatile acidity"] = row["volatile acidity"] - acidity_mean
  
  return row

In [47]:
wine_df.apply(threshold_points, axis=1)

Unnamed: 0,type,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,white,7.0,-0.069691,0.36,20.7,0.045,45.0,170.0,1.00100,3.00,0.45,8.8,6
1,white,6.3,-0.039691,0.34,1.6,0.049,14.0,132.0,0.99400,3.30,0.49,9.5,6
2,white,8.1,-0.059691,0.40,6.9,0.050,30.0,97.0,0.99510,3.26,0.44,10.1,6
3,white,7.2,-0.109691,0.32,8.5,0.058,47.0,186.0,0.99560,3.19,0.40,9.9,6
4,white,7.2,-0.109691,0.32,8.5,0.058,47.0,186.0,0.99560,3.19,0.40,9.9,6
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6492,red,6.2,0.260309,0.08,2.0,0.090,32.0,44.0,0.99490,3.45,0.58,10.5,5
6493,red,5.9,0.210309,0.10,2.2,0.062,39.0,51.0,0.99512,3.52,,11.2,6
6494,red,6.3,0.170309,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6
6495,red,5.9,0.305309,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,5


In [48]:
# Both apply and map return new dataframes, do not modify the original ones

### Groupby

In [49]:
wine_df.groupby('quality').mean()

Unnamed: 0_level_0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
quality,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
3,7.853333,0.517,0.281,5.14,0.077033,39.216667,122.033333,0.995744,3.257667,0.506333,10.215
4,7.288889,0.457963,0.272315,4.153704,0.060158,20.636574,103.43287,0.994833,3.23162,0.504884,10.180093
5,7.329348,0.389768,0.307721,5.804116,0.064666,30.237371,120.839102,0.995849,3.212042,0.526416,9.837783
6,7.178037,0.313731,0.323786,5.551182,0.054168,31.165021,115.41079,0.994558,3.217701,0.532466,10.587553
7,7.128962,0.288895,0.334764,4.733952,0.045272,30.42215,108.49861,0.993126,3.22779,0.547025,11.386006
8,6.838542,0.29101,0.332539,5.382902,0.041124,34.533679,117.518135,0.992514,3.223212,0.512487,11.678756
9,7.42,0.298,0.386,4.12,0.0274,33.4,116.0,0.99146,3.308,0.466,12.18


In [50]:
wine_df.groupby('quality').sum()

Unnamed: 0_level_0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
quality,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
3,235.6,15.51,8.43,154.2,2.311,1176.5,3661.0,29.87232,97.73,15.19,306.45
4,1574.4,98.92,58.82,897.2,12.934,4457.5,22341.5,214.88385,698.03,108.55,2198.9
5,15633.5,832.545,657.6,12409.2,138.256,64647.5,258354.0,2129.125135,6857.71,1124.95,21033.18
6,20328.2,888.485,917.61,15737.6,153.566,88384.0,327305.0,2820.567455,9112.53,1509.01,30026.3
7,7692.15,311.14,361.21,5103.2,48.849,32825.5,117070.0,1071.58286,3476.33,590.24,12285.5
8,1313.0,56.165,64.18,1038.9,7.937,6665.0,22681.0,191.55511,622.08,98.91,2254.0
9,37.1,1.49,1.93,20.6,0.137,167.0,580.0,4.9573,16.54,2.33,60.9


### Sorting

In [51]:
wine_df.sort_values(by='fixed acidity')

Unnamed: 0,type,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
4259,white,3.8,0.310,0.02,11.1,0.036,20.0,114.0,0.99248,3.75,0.44,12.4,6
4787,white,3.9,0.225,0.40,4.2,0.030,29.0,118.0,0.98900,3.57,0.36,12.8,8
3265,white,4.2,0.215,0.23,5.1,0.041,64.0,157.0,0.99688,3.42,0.44,8.0,3
2872,white,4.2,0.170,0.36,1.8,0.029,93.0,161.0,0.98999,3.65,0.89,12.0,7
4847,white,4.4,0.540,0.09,5.1,0.038,52.0,97.0,0.99022,3.41,0.40,12.2,7
...,...,...,...,...,...,...,...,...,...,...,...,...,...
518,white,,0.130,0.28,1.9,0.050,20.0,78.0,0.99180,3.43,0.64,10.8,6
1079,white,,,0.29,6.2,0.046,29.0,227.0,0.99520,3.29,0.53,10.1,6
2902,white,,0.360,0.14,8.9,0.036,38.0,155.0,0.99622,3.27,,9.4,5
6428,red,,0.440,0.09,2.2,0.063,9.0,18.0,0.99444,,0.69,11.3,6


In [52]:
sorted_wine_df = wine_df.sort_values(by='fixed acidity')

In [53]:
sorted_wine_df

Unnamed: 0,type,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
4259,white,3.8,0.310,0.02,11.1,0.036,20.0,114.0,0.99248,3.75,0.44,12.4,6
4787,white,3.9,0.225,0.40,4.2,0.030,29.0,118.0,0.98900,3.57,0.36,12.8,8
3265,white,4.2,0.215,0.23,5.1,0.041,64.0,157.0,0.99688,3.42,0.44,8.0,3
2872,white,4.2,0.170,0.36,1.8,0.029,93.0,161.0,0.98999,3.65,0.89,12.0,7
4847,white,4.4,0.540,0.09,5.1,0.038,52.0,97.0,0.99022,3.41,0.40,12.2,7
...,...,...,...,...,...,...,...,...,...,...,...,...,...
518,white,,0.130,0.28,1.9,0.050,20.0,78.0,0.99180,3.43,0.64,10.8,6
1079,white,,,0.29,6.2,0.046,29.0,227.0,0.99520,3.29,0.53,10.1,6
2902,white,,0.360,0.14,8.9,0.036,38.0,155.0,0.99622,3.27,,9.4,5
6428,red,,0.440,0.09,2.2,0.063,9.0,18.0,0.99444,,0.69,11.3,6


In [54]:
wine_df["fixed acidity"].min()

3.8

In [55]:
sorted_wine_df.reset_index()

Unnamed: 0,index,type,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,4259,white,3.8,0.310,0.02,11.1,0.036,20.0,114.0,0.99248,3.75,0.44,12.4,6
1,4787,white,3.9,0.225,0.40,4.2,0.030,29.0,118.0,0.98900,3.57,0.36,12.8,8
2,3265,white,4.2,0.215,0.23,5.1,0.041,64.0,157.0,0.99688,3.42,0.44,8.0,3
3,2872,white,4.2,0.170,0.36,1.8,0.029,93.0,161.0,0.98999,3.65,0.89,12.0,7
4,4847,white,4.4,0.540,0.09,5.1,0.038,52.0,97.0,0.99022,3.41,0.40,12.2,7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6492,518,white,,0.130,0.28,1.9,0.050,20.0,78.0,0.99180,3.43,0.64,10.8,6
6493,1079,white,,,0.29,6.2,0.046,29.0,227.0,0.99520,3.29,0.53,10.1,6
6494,2902,white,,0.360,0.14,8.9,0.036,38.0,155.0,0.99622,3.27,,9.4,5
6495,6428,red,,0.440,0.09,2.2,0.063,9.0,18.0,0.99444,,0.69,11.3,6


In [56]:
sorted_wine_df.reset_index(drop=True)

Unnamed: 0,type,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,white,3.8,0.310,0.02,11.1,0.036,20.0,114.0,0.99248,3.75,0.44,12.4,6
1,white,3.9,0.225,0.40,4.2,0.030,29.0,118.0,0.98900,3.57,0.36,12.8,8
2,white,4.2,0.215,0.23,5.1,0.041,64.0,157.0,0.99688,3.42,0.44,8.0,3
3,white,4.2,0.170,0.36,1.8,0.029,93.0,161.0,0.98999,3.65,0.89,12.0,7
4,white,4.4,0.540,0.09,5.1,0.038,52.0,97.0,0.99022,3.41,0.40,12.2,7
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6492,white,,0.130,0.28,1.9,0.050,20.0,78.0,0.99180,3.43,0.64,10.8,6
6493,white,,,0.29,6.2,0.046,29.0,227.0,0.99520,3.29,0.53,10.1,6
6494,white,,0.360,0.14,8.9,0.036,38.0,155.0,0.99622,3.27,,9.4,5
6495,red,,0.440,0.09,2.2,0.063,9.0,18.0,0.99444,,0.69,11.3,6


### Handling Data Types and Missing Values

In [57]:
sorted_wine_df["free sulfur dioxide"].dtype

dtype('float64')

In [58]:
sorted_wine_df["free sulfur dioxide"].astype("int64")

4259    20
4787    29
3265    64
2872    93
4847    52
        ..
518     20
1079    29
2902    38
6428     9
6429    13
Name: free sulfur dioxide, Length: 6497, dtype: int64

In [59]:
wine_df.dropna()

Unnamed: 0,type,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,white,7.0,0.270,0.36,20.7,0.045,45.0,170.0,1.00100,3.00,0.45,8.8,6
1,white,6.3,0.300,0.34,1.6,0.049,14.0,132.0,0.99400,3.30,0.49,9.5,6
2,white,8.1,0.280,0.40,6.9,0.050,30.0,97.0,0.99510,3.26,0.44,10.1,6
3,white,7.2,0.230,0.32,8.5,0.058,47.0,186.0,0.99560,3.19,0.40,9.9,6
4,white,7.2,0.230,0.32,8.5,0.058,47.0,186.0,0.99560,3.19,0.40,9.9,6
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6491,red,6.8,0.620,0.08,1.9,0.068,28.0,38.0,0.99651,3.42,0.82,9.5,6
6492,red,6.2,0.600,0.08,2.0,0.090,32.0,44.0,0.99490,3.45,0.58,10.5,5
6494,red,6.3,0.510,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6
6495,red,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,5


In [60]:
wine_df

Unnamed: 0,type,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,white,7.0,0.270,0.36,20.7,0.045,45.0,170.0,1.00100,3.00,0.45,8.8,6
1,white,6.3,0.300,0.34,1.6,0.049,14.0,132.0,0.99400,3.30,0.49,9.5,6
2,white,8.1,0.280,0.40,6.9,0.050,30.0,97.0,0.99510,3.26,0.44,10.1,6
3,white,7.2,0.230,0.32,8.5,0.058,47.0,186.0,0.99560,3.19,0.40,9.9,6
4,white,7.2,0.230,0.32,8.5,0.058,47.0,186.0,0.99560,3.19,0.40,9.9,6
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6492,red,6.2,0.600,0.08,2.0,0.090,32.0,44.0,0.99490,3.45,0.58,10.5,5
6493,red,5.9,0.550,0.10,2.2,0.062,39.0,51.0,0.99512,3.52,,11.2,6
6494,red,6.3,0.510,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6
6495,red,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,5


In [61]:
wine_df[wine_df["fixed acidity"].isnull()]

Unnamed: 0,type,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
17,white,,0.66,0.48,1.2,0.029,29.0,75.0,0.9892,3.33,0.39,12.8,8
174,white,,0.27,0.31,17.7,0.051,33.0,173.0,0.999,3.09,0.64,10.2,5
249,white,,0.41,0.14,10.4,0.037,18.0,119.0,0.996,3.38,0.45,10.0,5
267,white,,0.58,0.07,6.9,0.043,34.0,149.0,0.9944,3.34,0.57,9.7,5
368,white,,0.29,0.48,2.3,0.049,36.0,178.0,0.9931,3.17,0.64,10.6,6
518,white,,0.13,0.28,1.9,0.05,20.0,78.0,0.9918,3.43,0.64,10.8,6
1079,white,,,0.29,6.2,0.046,29.0,227.0,0.9952,3.29,0.53,10.1,6
2902,white,,0.36,0.14,8.9,0.036,38.0,155.0,0.99622,3.27,,9.4,5
6428,red,,0.44,0.09,2.2,0.063,9.0,18.0,0.99444,,0.69,11.3,6
6429,red,,0.705,0.1,2.8,0.081,13.0,28.0,0.99631,,0.66,10.2,5


In [62]:
mean_acidity = wine_df["fixed acidity"].mean()

In [63]:
mean_acidity

7.2165793124710955

In [64]:
wine_df["fixed acidity"].fillna(value=mean_acidity, inplace=True)

In [65]:
wine_df.loc[17]

type                       white
fixed acidity           7.216579
volatile acidity            0.66
citric acid                 0.48
residual sugar               1.2
chlorides                  0.029
free sulfur dioxide         29.0
total sulfur dioxide        75.0
density                   0.9892
pH                          3.33
sulphates                   0.39
alcohol                     12.8
quality                        8
Name: 17, dtype: object

### Combining Dataframes

In [66]:
pd.concat([df, wine_df])

Unnamed: 0,Classes,Schools,Seniority,type,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
a,AI,CS,Junior,,,,,,,,,,,,,
b,Stats,Maths,Senior,,,,,,,,,,,,,
c,Thermodynamics,Engineering,Sophomore,,,,,,,,,,,,,
0,,,,white,7.0,0.270,0.36,20.7,0.045,45.0,170.0,1.00100,3.00,0.45,8.8,6.0
1,,,,white,6.3,0.300,0.34,1.6,0.049,14.0,132.0,0.99400,3.30,0.49,9.5,6.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6492,,,,red,6.2,0.600,0.08,2.0,0.090,32.0,44.0,0.99490,3.45,0.58,10.5,5.0
6493,,,,red,5.9,0.550,0.10,2.2,0.062,39.0,51.0,0.99512,3.52,,11.2,6.0
6494,,,,red,6.3,0.510,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6.0
6495,,,,red,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,5.0


In [67]:
data_dict_1 = {
    "Classes" : ["AI", "Stats", "Thermodynamics"],
    "Schools": ["CS", "Maths", "Engineering"],
    }

df_1 = pd.DataFrame(data_dict_1)

data_dict_2 = {
    "Seniority": ["Junior", "Senior", "Sophomore"]    
    }

df_2 = pd.DataFrame(data_dict_2)

In [68]:
df_1

Unnamed: 0,Classes,Schools
0,AI,CS
1,Stats,Maths
2,Thermodynamics,Engineering


In [69]:
df_2

Unnamed: 0,Seniority
0,Junior
1,Senior
2,Sophomore


In [70]:
df_1.join(df_2)

Unnamed: 0,Classes,Schools,Seniority
0,AI,CS,Junior
1,Stats,Maths,Senior
2,Thermodynamics,Engineering,Sophomore


## Time to test your knowledge!

### Question 1 - Groupby, Summary functions, Indexing


Suppose you are a shopkeeper who sells wine. You have a limited amount of money to buy new stock. In order to decide what quality and alcohol percentage wine to buy, you want to figure out what quality of wine sold the most, and for that, quality, what the average alcohol percentage was. 

Steps
1. Figure out which quality of wine sold the most. (*Hint: count(), groupby()*)
2. Filter out the relevant portion of the dataset. (*Hint: Boolean indexing*)
3. Find average value for the "alcohol" column. (*Hint: mean()*)



## What's next?

- Kaggle course: https://www.kaggle.com/learn/pandas
- Pandas docs: https://pandas.pydata.org/docs/


## Next session on Wednesday, 2nd November 
- Basic Machine Learning concepts

## Future sessions	
- Intermediate machine learning