<a href="https://colab.research.google.com/github/AhmadNass7/teaching/blob/main/Data_Processing_and_Feature_Engineering_of_the_Dataset_Final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Preprocessing and Feature Engineering
You should build a machine learning pipeline with a data preprocessing and feature engineering step. In particular, you should do the following:
- Load the `adult` dataset using [Pandas](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html). You can find this dataset in the datasets folder.
- Split the dataset into training and test sets using [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html). 
- Conduct data preprocessing and feature engineering by 
    - removing missing values using [Pandas](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html);
    - encoding categorical attributes using [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html);
    - normalizing/scaling features using [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html);
    - handling imbalanced classes using [Imbalanced-Learn](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html);
    - and reducing the dimensionality of the dataset using [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html).
- Train and test a support vector machine model using [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html).
- Evaluate the impact of the data preprocessing and feature engineering techniques on the effectiveness and efficiency of the model.
- Check the documentation to identify the most important hyperparameters, attributes, and methods. Use them in practice.

## Importing Libraries

In [178]:
import pandas as pd 
import sklearn.model_selection
import sklearn.preprocessing
import imblearn.over_sampling
import sklearn.svm
import sklearn.tree
import sklearn.linear_model
import sklearn.metric

## Loading the Data Sets

In [120]:
## The Following Data Set is basically about lung cancer stating:



## Gender: M(male), F(female)

## Age: Age of the patient

## Smoking: YES=2 , NO=1.

## Yellow fingers: YES=2 , NO=1.

## Anxiety: YES=2 , NO=1.

## Peer_pressure: YES=2 , NO=1.

## Chronic Disease: YES=2 , NO=1.

## Fatigue: YES=2 , NO=1.

## Allergy: YES=2 , NO=1.

## Wheezing: YES=2 , NO=1.

## Alcohol: YES=2 , NO=1.

## Coughing: YES=2 , NO=1.

## Shortness of Breath: YES=2 , NO=1.

## Swallowing Difficulty: YES=2 , NO=1.

## Chest pain: YES=2 , NO=1.

## Lung Cancer: YES , NO.



df=pd.read_csv("slc.csv")
df.head()

Unnamed: 0,GENDER,AGE,SMOKING,YELLOW_FINGERS,ANXIETY,PEER_PRESSURE,CHRONIC DISEASE,FATIGUE,ALLERGY,WHEEZING,ALCOHOL CONSUMING,COUGHING,SHORTNESS OF BREATH,SWALLOWING DIFFICULTY,CHEST PAIN,LUNG_CANCER
0,M,69,1,2,2,1,1,2,1,2,2,2,2,2,2,YES
1,M,74,2,1,1,1,2,2,2,1,1,1,2,2,2,YES
2,F,59,1,1,1,2,1,2,1,2,1,2,2,1,2,NO
3,M,63,2,2,2,1,1,1,1,1,2,1,1,2,2,NO
4,F,63,1,2,1,1,1,1,1,2,1,2,2,1,1,NO


## Splitting The Data Sets into Training and Testing

In [121]:
df_train, df_test= sklearn.model_selection.train_test_split(df)


print("df_train size:", df_train.shape)
print("df_test size:", df_test.shape)

df_train size: (231, 16)
df_test size: (78, 16)


## Data Exploration


In [122]:
## Training
df_train["AGE"].unique()


array([58, 56, 75, 72, 61, 49, 69, 60, 67, 55, 51, 63, 64, 59, 77, 62, 71,
       76, 70, 48, 38, 57, 47, 52, 68, 54, 44, 73, 79, 65, 81, 74, 39, 66,
       53, 21, 78])

In [123]:
df_train["SMOKING"].unique()

array([2, 1])

In [124]:
df_train["YELLOW_FINGERS"].unique()

array([2, 1])

In [125]:
df_train["ANXIETY"].unique()

array([2, 1])

In [126]:
df_train["PEER_PRESSURE"].unique()

array([2, 1])

In [127]:
df_train["CHRONIC DISEASE"].unique()

array([1, 2])

In [128]:
df_train = df_train.rename(columns={'FATIGUE ' : 'FATIGUE', 'ALLERGY ' : 'ALLERGY'})

In [129]:
df_train["FATIGUE"].unique()

array([2, 1])

In [130]:
df_train["ALLERGY"].unique()

array([1, 2])

In [131]:
df_train["WHEEZING"].unique()

array([1, 2])

In [132]:
df_train["ALCOHOL CONSUMING"].unique()

array([1, 2])

In [133]:
df_train["COUGHING"].unique()

array([2, 1])

In [134]:
df_train["SHORTNESS OF BREATH"].unique()

array([2, 1])

In [135]:
df_train["SWALLOWING DIFFICULTY"].unique()

array([2, 1])

In [136]:
df_train["CHEST PAIN"].unique()

array([1, 2])

In [137]:
df_train["LUNG_CANCER"].unique()

array(['YES', 'NO'], dtype=object)

In [138]:
## Testing

df_test["AGE"].unique()

array([62, 60, 63, 64, 70, 76, 57, 65, 78, 54, 67, 52, 51, 59, 55, 68, 72,
       61, 58, 48, 71, 66, 56, 46, 53, 74, 77, 87])

In [139]:
df_test["SMOKING"].unique()

array([2, 1])

In [140]:
df_test["YELLOW_FINGERS"].unique()

array([2, 1])

In [141]:
df_test["ANXIETY"].unique()

array([2, 1])

In [142]:
df_test["PEER_PRESSURE"].unique()

array([2, 1])

In [143]:
df_test["CHRONIC DISEASE"].unique()

array([2, 1])

In [144]:
df_test = df_test.rename(columns={'FATIGUE ' : 'FATIGUE', 'ALLERGY ' : 'ALLERGY'})

In [145]:
df_test["FATIGUE"].unique()

array([1, 2])

In [146]:
df_test["ALLERGY"].unique()

array([1, 2])

In [147]:
df_test["WHEEZING"].unique()

array([1, 2])

In [148]:
df_test["ALCOHOL CONSUMING"].unique()

array([2, 1])

In [149]:
df_test["COUGHING"].unique()

array([1, 2])

In [150]:
df_test["SHORTNESS OF BREATH"].unique()

array([1, 2])

In [151]:
df_test["SWALLOWING DIFFICULTY"].unique()

array([2, 1])

In [152]:
df_test["CHEST PAIN"].unique()

array([2, 1])

In [153]:
df_test["LUNG_CANCER"].unique()

array(['YES', 'NO'], dtype=object)

In [154]:
df_test["LUNG_CANCER"].value_counts() # this means the data set should be balanced theres is a huge difference hence we should use SMOTE

YES    70
NO      8
Name: LUNG_CANCER, dtype: int64

## Data Processing and Feature Engineering

In [155]:
df_train.isnull().sum() ## My Data Set doesnt have any missing values

GENDER                   0
AGE                      0
SMOKING                  0
YELLOW_FINGERS           0
ANXIETY                  0
PEER_PRESSURE            0
CHRONIC DISEASE          0
FATIGUE                  0
ALLERGY                  0
WHEEZING                 0
ALCOHOL CONSUMING        0
COUGHING                 0
SHORTNESS OF BREATH      0
SWALLOWING DIFFICULTY    0
CHEST PAIN               0
LUNG_CANCER              0
dtype: int64

## Seperating The Features and The Target Label

In [156]:
x_train= df_train.drop(["LUNG_CANCER"], axis=1)
y_train= df_train["LUNG_CANCER"]

x_test= df_test.drop(["LUNG_CANCER"], axis=1)
y_test= df_test["LUNG_CANCER"]




print("x_train size:",x_train.shape)
print("x_test size:",x_test.shape)
print("y_train size:",y_train.shape)
print("y_test size:",y_test.shape)

x_train size: (231, 15)
x_test size: (78, 15)
y_train size: (231,)
y_test size: (78,)


## Encoding Categorical Attributes

In [157]:
# Implenting "The Hot Encoder"  

man= sklearn.preprocessing.OneHotEncoder(handle_unknown="ignore") ## creating encoding object
man.fit(x_train) ## fitting the econdinng on training data


#Encoding the categorical attributes of training Data
x_train= man.transform(x_train)

#Encoding the categorical attributes of test data
x_test= man.transform(x_test)

print("x_train:", x_train.shape)
print("x_test:", x_test.shape)

x_train: (231, 65)
x_test: (78, 65)


## Standardlization

In [158]:
# Building a standardization model
scaler= sklearn.preprocessing.StandardScaler(with_mean=False)
scaler.fit(x_train)

# Scaling the training features
x_train= scaler.transform(x_train)

#Reducing the number of test features
x_test= scaler.transform(x_test)

print("x_train:", x_train.shape)
print("x_test:", x_test.shape)

x_train: (231, 65)
x_test: (78, 65)


## Imbalanced Classes in pca the most imp component is n_

In [159]:
sm = imblearn.over_sampling.SMOTE(random_state=42) # dont do it in final assignemnet the random state
x_train_resampled, y_train_resampled = sm.fit_resample(x_train, y_train)
print("x_train_resampled size", x_train_resampled.shape)
print("y_train_resampled size", y_train_resampled.shape)

x_train_resampled size (400, 65)
y_train_resampled size (400,)


## Choosing The algorithms

In [160]:
model_svm= sklearn.svm.SVC()
model_decision_trees= sklearn.tree.DecisionTreeClassifier()

## Defining a Grid

In [170]:
parameters_svm= {'kernel':('linear', 'rbf'), 'C':[1, 10]}
model_svm=sklearn.svm.SVC()
clf_svm= sklearn.model_selection.GridSearchCV(model_svm,parameters_svm, cv=5)
clf_svm.fit(x_train, y_train)

In [171]:
y_preds = clf_svm.predict(x_test)

In [204]:

accuracy = sklearn.metrics.accuracy_score(y_preds, y_test)

In [173]:
print(accuracy)

0.9230769230769231


In [174]:
parameters_decision_trees= {'criterion':('entropy','gini'), 'max_depth':[2,4]}
model_decision_trees= sklearn.tree.DecisionTreeClassifier()
clf_decision_trees=sklearn.model_selection.GridSearchCV(model_decision_trees,parameters_decision_trees, scoring='accuracy', cv= 5, n_jobs=-1)
clf_decision_trees.fit(x_train, y_train)
y_preds_decision_trees= clf_decision_trees.predict(x_test)

In [175]:
y_preds= clf_decision_trees.predict(x_test)

In [177]:

accuracy = sklearn.metrics.accuracy_score(y_preds, y_test)
print(accuracy)

0.8846153846153846


In [199]:
parameters_logistic_regression={ 'C' :[1.0]}
model_logistic_regression= sklearn.linear_model.LogisticRegression()
clf_logistic_regression=sklearn.model_selection.GridSearchCV(model_logistic_regression,parameters_logistic_regression, scoring='accuracy', cv= 5, n_jobs=-1)


In [202]:
clf_logistic_regression.fit(x_train, y_train)

In [201]:
y_preds= clf_logistic_regression.predict(x_test)

In [203]:
accuracy = sklearn.metrics.accuracy_score(y_preds, y_test)
print(accuracy)

0.9230769230769231
