<a href="https://colab.research.google.com/github/DeepsMaxi305/Data_Science/blob/main/data_preprocessing_and_feature_engineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Preprocessing and Feature Engineering
You should build a machine learning pipeline with a data preprocessing and feature engineering step. In particular, you should do the following:
- Load the `adult` dataset using [Pandas](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html). You can find this dataset in the datasets folder.
- Split the dataset into training and test sets using [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html). 
- Conduct data preprocessing and feature engineering by 
    - removing missing values using [Pandas](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html);
    - encoding categorical attributes using [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html);
    - normalizing/scaling features using [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html);
    - handling imbalanced classes using [Imbalanced-Learn](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html);
    - and reducing the dimensionality of the dataset using [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html).
- Train and test a support vector machine model using [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html).
- Evaluate the impact of the data preprocessing and feature engineering techniques on the effectiveness and efficiency of the model.
- Check the documentation to identify the most important hyperparameters, attributes, and methods. Use them in practice.

#Importing the Libraries

In [27]:
import pandas as pd
import sklearn.model_selection
import sklearn.svm
import sklearn.metrics
import sklearn.preprocessing
import sklearn.decomposition
import imblearn.over_sampling



#Loading the Dataset

In [28]:
df = pd.read_csv("adult.csv")
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,target
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


#Splitting the dataset into training and test sets

In [29]:
df_train, df_test = sklearn.model_selection.train_test_split(df)
print("df_train size:", df_train.shape)
print("df_test size:", df_test.shape)



df_train size: (24420, 15)
df_test size: (8141, 15)


#Data Exploration

In [30]:
df_train["workclass"].value_counts()

 Private             17062
 Self-emp-not-inc     1901
 Local-gov            1555
 ?                    1343
 State-gov             993
 Self-emp-inc          832
 Federal-gov           718
 Without-pay            10
 Never-worked            6
Name: workclass, dtype: int64

In [31]:
df_train["target"].value_counts()


 <=50K    18548
 >50K      5872
Name: target, dtype: int64

#Data Preprocessing and Feature Engineering

Removing Missing Values

In [32]:
df_train = df_train.replace(" ?",pd.NaT)
df_train = df_train.dropna()
df_test = df_test.replace(" ?",pd.NaT)
df_test = df_test.dropna()
print("df_train size:", df_train.shape)
print("df_test size:", df_test.shape)

df_train size: (22659, 15)
df_test size: (7503, 15)


#Seperating the Features and Target Label

In [33]:
x_train = df_train.drop(["target"], axis =1)
y_train = df_train["target"]

x_test = df_test.drop(["target"], axis =1)
y_test = df_test["target"]

print("x_train size:", x_train.shape)
print("x_test size:", x_test.shape)
print("y_train size:", y_train.shape)
print("y_test size:", y_test.shape)

x_train size: (22659, 14)
x_test size: (7503, 14)
y_train size: (22659,)
y_test size: (7503,)


#Encoding Categorical attributes

In [34]:
enc = sklearn.preprocessing.OneHotEncoder(handle_unknown="ignore")
enc.fit(x_train)
x_train = enc.transform(x_train)
x_test = enc.transform(x_test)

print("x_train:", x_train.shape)
print("x_test:", x_test.shape)

x_train: (22659, 16947)
x_test: (7503, 16947)


#Normalizing/Scaling Features

In [35]:
scaler = sklearn.preprocessing.StandardScaler(with_mean =False)
scaler.fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)
print("x_train:", x_train.shape)
print("x_test:", x_test.shape)

x_train: (22659, 16947)
x_test: (7503, 16947)
