# Pandas Day 3 

# Binning


## What is Binning?
### `Binning` is a data preprocessing technique used to group continuous data into discrete intervals or categories. It is often used in data analysis and machine learning to reduce the effects of minor observation errors and make the data easier to interpret. Binning can also help in feature engineering by creating categorical features from numerical ones. 
### `For Example` magine you have a dataset of people's ages, and you want to group them into categories like "Infants," "Toddlers," "Kids," etc., instead of working with individual ages. This grouping is called binning.
### Ages 0 to 1 are grouped as "Infants".
### Ages 2 to 5 are grouped as "Toddlers".
### Ages 6 to 12 are grouped as "Kids".
### Ages 13 to 18 are grouped as "Teens".
### Ages 19 to 30 are grouped as "Youngs".
### Ages 31 to 50 are grouped as "Middle Aged".
### Ages 51 to 80 are grouped as "Old". 
-----

# Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
# Load dataset
df = sns.load_dataset("Titanic")
# Display first few rows
print(df.head())

   survived  pclass     sex   age  sibsp  parch     fare embarked  class  \
0         0       3    male  22.0      1      0   7.2500        S  Third   
1         1       1  female  38.0      1      0  71.2833        C  First   
2         1       3  female  26.0      0      0   7.9250        S  Third   
3         1       1  female  35.0      1      0  53.1000        S  First   
4         0       3    male  35.0      0      0   8.0500        S  Third   

     who  adult_male deck  embark_town alive  alone  
0    man        True  NaN  Southampton    no  False  
1  woman       False    C    Cherbourg   yes  False  
2  woman       False  NaN  Southampton   yes   True  
3  woman       False    C  Southampton   yes  False  
4    man        True  NaN  Southampton    no   True  


In [3]:
# Display dataset information
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   survived     891 non-null    int64  
 1   pclass       891 non-null    int64  
 2   sex          891 non-null    object 
 3   age          714 non-null    float64
 4   sibsp        891 non-null    int64  
 5   parch        891 non-null    int64  
 6   fare         891 non-null    float64
 7   embarked     889 non-null    object 
 8   class        891 non-null    object 
 9   who          891 non-null    object 
 10  adult_male   891 non-null    bool   
 11  deck         203 non-null    object 
 12  embark_town  889 non-null    object 
 13  alive        891 non-null    object 
 14  alone        891 non-null    bool   
dtypes: bool(2), float64(2), int64(4), object(7)
memory usage: 92.4+ KB
None


In [8]:
# Impute the missing values in the 'age' column with the mean age
df['age'] = df['age'].fillna(df['age'].mean())

In [9]:
#  let's check the missing values again
print(df.isnull().sum())

survived         0
pclass           0
sex              0
age              0
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64


In [10]:
# Binning of age coloumns in to 7 categories
bins = [0,1,5,12,18,30,50,80]
labels = ["Infants","Toddlers","Kids","Teens","Youngs","Middle Aged","Old"]
#Which coloumn is converted into bins
pd.cut(df['age'], bins=bins, labels=labels)

0           Youngs
1      Middle Aged
2           Youngs
3      Middle Aged
4      Middle Aged
          ...     
886         Youngs
887         Youngs
888         Youngs
889         Youngs
890    Middle Aged
Name: age, Length: 891, dtype: category
Categories (7, object): ['Infants' < 'Toddlers' < 'Kids' < 'Teens' < 'Youngs' < 'Middle Aged' < 'Old']

# Feature Engineering


### `Feature Engineering` is the process of creating, modifying, or selecting features (variables) in a dataset to improve the performance of machine learning models. It involves transforming raw data into meaningful features that better represent the underlying problem to the predictive models.

In [18]:
# Adding new coloumn in a dataset on the base of other coloumn in a dataset.
bins = [0,1,5,12,18,30,50,80]
labels = ["Infants","Toddlers","Kids","Teens","Youngs","Middle Aged","Old"]
#Which coloumn is converted into bins
df["binned_age"] = pd.cut(df['age'], bins=bins, labels=labels)


In [21]:
df.sample(5)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,binned_age
480,0,3,male,9.0,5,2,46.9,S,Third,child,False,,Southampton,no,False,Kids
823,1,3,female,27.0,0,1,12.475,S,Third,woman,False,E,Southampton,yes,False,Youngs
599,1,1,male,49.0,1,0,56.9292,C,First,man,True,A,Cherbourg,yes,False,Middle Aged
529,0,2,male,23.0,2,1,11.5,S,Second,man,True,,Southampton,no,False,Youngs
888,0,3,female,29.699118,1,2,23.45,S,Third,woman,False,,Southampton,no,False,Youngs


In [63]:
df["binned_age"].value_counts()

binned_age
Youngs         447
Middle Aged    241
Teens           70
Old             64
Toddlers        30
Kids            25
Infants         14
Name: count, dtype: int64

# Renaming of Coloumn

In [22]:
df.rename(columns={"binned_age":"Age_Groups"}, inplace=True)
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,Age_Groups
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False,Youngs
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False,Middle Aged
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True,Youngs
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False,Middle Aged
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True,Middle Aged


#  ðŸŽ¯ Mission Successful! ðŸš€âœ¨