# Day 2 - Pandas for Machine Learning
### Machine Learning Roadmap — Week 1
### Author: N Manish Kumar

## 1. Introduction

In this notebook, I will :
- Load and Inspect Data
- Select and Filter Data
- Handle Missing Values
- Convert Categorial Data to Numerical Data
- Basic Feature Engineering
- Basic EDA (Will go in depth tomorrow)
- Prepare Data for ML

Using the following : 
### Dataset : Titanic
### Goal : Clean and prepare data for ML

In [1]:
import pandas as pd

## 2. Loading and Inspecting Data
Loading the titanic dataset and observing the data for understanding.

In [2]:
df = pd.read_csv("data/train.csv")
# We can examine the contents of the resultant DataFrame using the head() command, which grabs the first five rows
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
# We can use the shape attribute to check how large the resulting DataFrame is
df.shape

(891, 12)

In [4]:
#  gives you a concise overview of the DataFrame. Also helps in spotting missing data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [5]:
# gives descriptive statistics for numeric columns by default
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


## 3. Select and Filter Data

In [6]:
# Column Selection (Selecting only required Columns)
print(df['Age'])
df[['Age', 'Fare', 'Survived']]

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64


Unnamed: 0,Age,Fare,Survived
0,22.0,7.2500,0
1,38.0,71.2833,1
2,26.0,7.9250,1
3,35.0,53.1000,1
4,35.0,8.0500,0
...,...,...,...
886,27.0,13.0000,0
887,19.0,30.0000,1
888,,23.4500,0
889,26.0,30.0000,1


In [7]:
# Row Filtering (Selecting rows that meet certain conditions)
df[df['Age'] > 30]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,C103,S
...,...,...,...,...,...,...,...,...,...,...,...,...
873,874,0,3,"Vander Cruyssen, Mr. Victor",male,47.0,0,0,345765,9.0000,,S
879,880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C50,C
881,882,0,3,"Markun, Mr. Johann",male,33.0,0,0,349257,7.8958,,S
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,,Q


#### iloc and loc
Both iloc and loc are row-first, column-second. This means that it's marginally easier to retrieve rows, and marginally harder to get retrieve columns. 
iloc - indexed-based selection
loc -label-based selection

In [8]:
df.loc[0:5, ['Age','Fare']]

Unnamed: 0,Age,Fare
0,22.0,7.25
1,38.0,71.2833
2,26.0,7.925
3,35.0,53.1
4,35.0,8.05
5,,8.4583


In [9]:
df.iloc[0:5,3]

0                              Braund, Mr. Owen Harris
1    Cumings, Mrs. John Bradley (Florence Briggs Th...
2                               Heikkinen, Miss. Laina
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                             Allen, Mr. William Henry
Name: Name, dtype: object

## 4. Handling Missing Values
Entries missing values are given the value NaN (Short form for Not a Number).

In [10]:
# To select NaN entries you can use pd.isnull()
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [11]:
# To fill missing values pd.fillna() can be used
df['Age']= df['Age'].fillna(df['Age'].median())
df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0],)

# to drop columns containing mising data 
df.drop(columns=['Cabin'],inplace=True)
df.drop(columns=['PassengerId'],inplace=True)

## 5. Convert Categorial Data to Numerical Data

In [12]:
df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})
# pd.get_dummies converts categorical columns into dummy/indicator variables
df = pd.get_dummies(df, columns=['Embarked'], drop_first=True)
df['Embarked_Q'] = df['Embarked_Q'].astype(int)
df['Embarked_S'] = df['Embarked_S'].astype(int)

# df.astype is a pandas method used to change the data type (dtype) of one or more columns in a DataFrame.
# df['Survived']=df['Survived'].astype(bool)
# df['Survived'].dtype

## 6. Basic Feature Engineering
Creating new columns, dropping irrelevant columns, Renaming columns . So that Better Features = Better Models.

In [13]:
# Creating new columns 
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1

In [14]:
# Dropping Irrelevant Columns
df.drop(columns=['Name','Ticket'],inplace=True)

In [15]:
df.columns

Index(['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare',
       'Embarked_Q', 'Embarked_S', 'FamilySize'],
      dtype='object')

## 7. Basic EDA 
EDA = Exploratory Data Analysis
It’s the process of exploring, summarizing, and understanding your dataset before doing any modeling or machine learning.
Think of EDA as getting to know your data:
- What columns exist
- What types of values they contain
- Whether there are missing values
- What patterns or relationships exist
- What distributions look like
- What anomalies or outliers might cause trouble
And pandas is the main tool you use to do all of this.

Value counts, groupby() , Correlation

In [16]:
df['Survived'].value_counts()

Survived
0    549
1    342
Name: count, dtype: int64

In [17]:
# groupby() splits your DataFrame into groups based on some column(s),then lets you apply an operation (like sum, mean, count, etc.)and finally combines the results
df.groupby('Sex')['Survived'].sum()

Sex
0    109
1    233
Name: Survived, dtype: int64

In [18]:
# corr() calculates the correlation between numeric columns in a DataFrame
df['Age'].corr(df['Fare'])

np.float64(0.09668842218036486)

## 9. Preparing Data for ML
Feauture/Target split and Converting data to Numpy

In [19]:
# This is the column we want to predict
y = df['Survived']
# X contains all the feature columns and y contains only the labels.
X= df.drop(columns=['Survived'],axis=1)
print(X.shape)
print(y.shape)

(891, 9)
(891,)


This data is perfect for ML.

In [20]:
# Optional but converting data into Numpy arrays is useful
X = X.values
y = y.values

## 10. Converting the Cleaned Data into a CSV file

In [21]:
df.to_csv("Data/titanic_cleaned.csv",index=False)

## 11. Checking Cleaned File

In [22]:
df_cleaned = pd.read_csv("Data/titanic_cleaned.csv")
df_cleaned.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked_Q,Embarked_S,FamilySize
0,0,3,0,22.0,1,0,7.25,0,1,2
1,1,1,1,38.0,1,0,71.2833,0,0,2
2,1,3,1,26.0,0,0,7.925,0,1,1
3,1,1,1,35.0,1,0,53.1,0,1,2
4,0,3,0,35.0,0,0,8.05,0,1,1


In [23]:
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Survived    891 non-null    int64  
 1   Pclass      891 non-null    int64  
 2   Sex         891 non-null    int64  
 3   Age         891 non-null    float64
 4   SibSp       891 non-null    int64  
 5   Parch       891 non-null    int64  
 6   Fare        891 non-null    float64
 7   Embarked_Q  891 non-null    int64  
 8   Embarked_S  891 non-null    int64  
 9   FamilySize  891 non-null    int64  
dtypes: float64(2), int64(8)
memory usage: 69.7 KB


In [24]:
df_cleaned.isnull().sum()

Survived      0
Pclass        0
Sex           0
Age           0
SibSp         0
Parch         0
Fare          0
Embarked_Q    0
Embarked_S    0
FamilySize    0
dtype: int64

## 11. Summary
Completed Day 2: Cleaned Titanic dataset using Pandas.
Handled missing values, encoded categorical variables, and engineered FamilySize.
Produced a fully clean dataset for ML modeling.