# Data Preprocessing and Feature Engineering in Machine Learning

### Objective:
This assignment aims to equip you with practical skills in data preprocessing, feature engineering, and feature selection techniques, which are crucial for building efficient machine learning models. You will work with a provided dataset to apply various techniques such as scaling, encoding, and feature selection methods including isolation forest and PPS score analysis.

### Dataset:
Given "Adult" dataset, which predicts whether income exceeds $50K/yr based on census data.


### 1. Data Exploration and Preprocessing
##### a. Load the dataset and conduct basic data exploration:

In [41]:
!pip install ppscore
import pandas as pd

# Load the dataset
df = pd.read_csv('anime.csv')

# Basic data exploration
print(df.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12294 non-null  int64  
 1   name      12294 non-null  object 
 2   genre     12232 non-null  object 
 3   type      12269 non-null  object 
 4   episodes  12294 non-null  object 
 5   rating    12064 non-null  float64
 6   members   12294 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 672.5+ KB
None


In [17]:
print(df.describe())


           anime_id        rating       members
count  12294.000000  12064.000000  1.229400e+04
mean   14058.221653      6.473902  1.807134e+04
std    11455.294701      1.026746  5.482068e+04
min        1.000000      1.670000  5.000000e+00
25%     3484.250000      5.880000  2.250000e+02
50%    10260.500000      6.570000  1.550000e+03
75%    24794.500000      7.180000  9.437000e+03
max    34527.000000     10.000000  1.013917e+06


In [18]:
print(df.isnull().sum())
df.head()


anime_id      0
name          0
genre        62
type         25
episodes      0
rating      230
members       0
dtype: int64


Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


##### b. Handle missing values:

For missing values, depending on the feature, we can either impute or remove them.

In [19]:
#Fill missing values for numerical columns with the mean
df.fillna(df.mean(), inplace=True)
df['type'].fillna(df['type'].mode(), inplace=True)

#Drop rows where categorical data is missing
df.dropna(subset=['genre'], inplace=True)


  df.fillna(df.mean(), inplace=True)


In [31]:
categorical_columns = df.select_dtypes(include=['object']).columns

for column in categorical_columns:
    df[column].fillna(df[column].mode()[0], inplace=True)


In [32]:
print(df.isnull().sum())

anime_id    0
name        0
genre       0
type        0
episodes    0
rating      0
members     0
dtype: int64


### 2. Encoding Techniques
Apply One-Hot Encoding and Label Encoding:

In [33]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

# One-Hot Encoding for categorical variables with less than 5 categories
df = pd.get_dummies(df, columns=['type'], drop_first=True)

# Label Encoding for categorical variables with more than 5 categories
le = LabelEncoder()
df['genre'] = le.fit_transform(df['genre'])


##### Discuss the pros and cons of One-Hot Encoding and Label Encoding

a. One-Hot Encoding pros: Avoids ordinality, useful for algorithms that cannot understand categorical data directly. Cons: Can lead to a high-dimensional sparse matrix if categories are numerous.

b. Label Encoding pros: Simple and efficient. Cons: Imposes ordinality which might not exist in the data, potentially leading to incorrect model assumptions.

### 3. Feature Engineering
Create new features and apply transformations:

In [35]:
import numpy as np 

# New feature: Average rating per member
df['rating_per_member'] = df['rating'] / df['members']

# New feature: Log transformation for members to handle skewness
df['log_members'] = np.log1p(df['members'])


##### Rationale:

'rating_per_member' provides a normalized rating which can help in understanding the popularity and quality of the anime more accurately.

Applying a log transformation to 'members' reduces skewness and can help improve model performance for algorithms sensitive to feature distributions.

### 4. Feature Selection
Use Isolation Forest to identify and remove outliers:

In [36]:
from sklearn.ensemble import IsolationForest

# Use Isolation Forest
iso = IsolationForest(contamination=0.1)
outliers = iso.fit_predict(df[['rating', 'members']])
df['outlier'] = outliers
df = df[df['outlier'] == 1]
df.drop('outlier', axis=1, inplace=True)


##### Discussion on outliers:
Outliers can skew the model training process, leading to poor generalization. Identifying and removing outliers ensures that the model is trained on data that represents the true distribution.

##### Apply Predictive Power Score (PPS) and compare with correlation matrix:

In [42]:

import ppscore as pps

# Compute PPS matrix
pps_matrix = pps.matrix(df)

# Compute correlation matrix
corr_matrix = df.corr()

print(pps_matrix)
print(corr_matrix)




           x         y   ppscore            case  is_valid_score  \
0   anime_id  anime_id  1.000000  predict_itself            True   
1   anime_id      name  0.000000    target_is_id            True   
2   anime_id     genre  0.085535  classification            True   
3   anime_id      type  0.182616  classification            True   
4   anime_id  episodes  0.012390  classification            True   
5   anime_id    rating  0.000000      regression            True   
6   anime_id   members  0.000000      regression            True   
7       name  anime_id  0.000000   feature_is_id            True   
8       name      name  1.000000  predict_itself            True   
9       name     genre  0.000000   feature_is_id            True   
10      name      type  0.000000  classification            True   
11      name  episodes  0.000000   feature_is_id            True   
12      name    rating  0.000000   feature_is_id            True   
13      name   members  0.000000   feature_is_id

  corr_matrix = df.corr()


##### Discussion:

PPS measures the predictive power of a feature regarding the target, capturing non-linear relationships, unlike the correlation matrix which only captures linear relationships. Comparing both can provide comprehensive insights into the data.