<a href="https://colab.research.google.com/github/AhmadNass7/teaching/blob/main/Data_Processing_and_Feature_Engineering_of_the_Dataset_Final4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Preprocessing and Feature Engineering
You should build a machine learning pipeline with a data preprocessing and feature engineering step. In particular, you should do the following:
- Load the `adult` dataset using [Pandas](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html). You can find this dataset in the datasets folder.
- Split the dataset into training and test sets using [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html). 
- Conduct data preprocessing and feature engineering by 
    - removing missing values using [Pandas](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html);
    - encoding categorical attributes using [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html);
    - normalizing/scaling features using [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html);
    - handling imbalanced classes using [Imbalanced-Learn](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html);
    - and reducing the dimensionality of the dataset using [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html).
- Train and test a support vector machine model using [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html).
- Evaluate the impact of the data preprocessing and feature engineering techniques on the effectiveness and efficiency of the model.
- Check the documentation to identify the most important hyperparameters, attributes, and methods. Use them in practice.

## Importing Libraries

In [1]:
## Importing Specific Libraries to be able to run our codes


import pandas as pd 
import sklearn.model_selection
import sklearn.preprocessing
import imblearn.over_sampling
import sklearn.svm
import sklearn.tree
import sklearn.neighbors
import sklearn.linear_model
import sklearn.metrics


## Loading the Data Sets

In [2]:
## The Following Data Set is basically about lung cancer stating:


## The Link of the Data Set: https://www.kaggle.com/datasets/mysarahmadbhat/lung-cancer



## Gender: M(male), F(female)

## Age: Age of the patient

## Smoking: YES=2 , NO=1.

## Yellow fingers: YES=2 , NO=1.

## Anxiety: YES=2 , NO=1.

## Peer_pressure: YES=2 , NO=1.

## Chronic Disease: YES=2 , NO=1.

## Fatigue: YES=2 , NO=1.

## Allergy: YES=2 , NO=1.

## Wheezing: YES=2 , NO=1.

## Alcohol: YES=2 , NO=1.

## Coughing: YES=2 , NO=1.

## Shortness of Breath: YES=2 , NO=1.

## Swallowing Difficulty: YES=2 , NO=1.

## Chest pain: YES=2 , NO=1.

## Lung Cancer: YES , NO.



df=pd.read_csv("slc.csv")
df.head()

Unnamed: 0,GENDER,AGE,SMOKING,YELLOW_FINGERS,ANXIETY,PEER_PRESSURE,CHRONIC DISEASE,FATIGUE,ALLERGY,WHEEZING,ALCOHOL CONSUMING,COUGHING,SHORTNESS OF BREATH,SWALLOWING DIFFICULTY,CHEST PAIN,LUNG_CANCER
0,M,69,1,2,2,1,1,2,1,2,2,2,2,2,2,YES
1,M,74,2,1,1,1,2,2,2,1,1,1,2,2,2,YES
2,F,59,1,1,1,2,1,2,1,2,1,2,2,1,2,NO
3,M,63,2,2,2,1,1,1,1,1,2,1,1,2,2,NO
4,F,63,1,2,1,1,1,1,1,2,1,2,2,1,1,NO


## Splitting The Data Sets into Training and Testing

In [3]:
# Splitting the data Set into training and testing , it is shown that 231 were trained and 78 were tested 



df_train, df_test= sklearn.model_selection.train_test_split(df, random_state=20)


print("df_train size:", df_train.shape)
print("df_test size:", df_test.shape) 

df_train size: (231, 16)
df_test size: (78, 16)


## Data Exploration


In [4]:
## Training
df_train["AGE"].unique()


array([70, 58, 71, 64, 39, 72, 62, 60, 47, 77, 51, 52, 59, 53, 49, 68, 69,
       61, 67, 48, 55, 75, 63, 74, 56, 65, 76, 66, 57, 44, 81, 73, 78, 54,
       79, 21, 46])

In [5]:
df_train["SMOKING"].unique()

array([2, 1])

In [6]:
df_train["YELLOW_FINGERS"].unique()

array([1, 2])

In [7]:
df_train["ANXIETY"].unique()

array([1, 2])

In [8]:
df_train["PEER_PRESSURE"].unique()

array([1, 2])

In [9]:
df_train["CHRONIC DISEASE"].unique()

array([1, 2])

In [10]:
df_train = df_train.rename(columns={'FATIGUE ' : 'FATIGUE', 'ALLERGY ' : 'ALLERGY'})

In [11]:
df_train["FATIGUE"].unique()

array([2, 1])

In [12]:
df_train["ALLERGY"].unique()

array([1, 2])

In [13]:
df_train["WHEEZING"].unique()

array([1, 2])

In [14]:
df_train["ALCOHOL CONSUMING"].unique()

array([1, 2])

In [15]:
df_train["COUGHING"].unique()

array([1, 2])

In [16]:
df_train["SHORTNESS OF BREATH"].unique()

array([2, 1])

In [17]:
df_train["SWALLOWING DIFFICULTY"].unique()

array([1, 2])

In [18]:
df_train["CHEST PAIN"].unique()

array([1, 2])

In [19]:
df_train["LUNG_CANCER"].unique()

array(['NO', 'YES'], dtype=object)

In [20]:
## Testing

df_test["AGE"].unique()

array([70, 56, 63, 51, 68, 54, 61, 67, 55, 69, 53, 73, 64, 59, 47, 57, 74,
       77, 58, 49, 44, 76, 60, 75, 81, 62, 72, 87, 38])

In [21]:
df_test["SMOKING"].unique()

array([1, 2])

In [22]:
df_test["YELLOW_FINGERS"].unique()

array([2, 1])

In [23]:
df_test["ANXIETY"].unique()

array([1, 2])

In [24]:
df_test["PEER_PRESSURE"].unique()

array([2, 1])

In [25]:
df_test["CHRONIC DISEASE"].unique()

array([1, 2])

In [26]:
## A mistake is recognized here stating that "Fatigue" and "Allergy" had a space in the data set thats why we renamed the column.


df_test = df_test.rename(columns={'FATIGUE ' : 'FATIGUE', 'ALLERGY ' : 'ALLERGY'}) 

In [27]:
df_test["FATIGUE"].unique()

array([2, 1])

In [28]:
df_test["ALLERGY"].unique()

array([2, 1])

In [29]:
df_test["WHEEZING"].unique()

array([2, 1])

In [30]:
df_test["ALCOHOL CONSUMING"].unique()

array([2, 1])

In [31]:
df_test["COUGHING"].unique()

array([2, 1])

In [32]:
df_test["SHORTNESS OF BREATH"].unique()

array([1, 2])

In [33]:
df_test["SWALLOWING DIFFICULTY"].unique()

array([1, 2])

In [34]:
df_test["CHEST PAIN"].unique()

array([1, 2])

In [35]:
df_test["LUNG_CANCER"].unique()

array(['YES', 'NO'], dtype=object)

In [36]:
df_test["LUNG_CANCER"].value_counts() # this means the data set should be balanced theres is a huge difference hence we should use SMOTE

YES    64
NO     14
Name: LUNG_CANCER, dtype: int64

## Data Processing and Feature Engineering

In [37]:
## My Data Set doesnt have any missing values


df_train.isnull().sum() 

GENDER                   0
AGE                      0
SMOKING                  0
YELLOW_FINGERS           0
ANXIETY                  0
PEER_PRESSURE            0
CHRONIC DISEASE          0
FATIGUE                  0
ALLERGY                  0
WHEEZING                 0
ALCOHOL CONSUMING        0
COUGHING                 0
SHORTNESS OF BREATH      0
SWALLOWING DIFFICULTY    0
CHEST PAIN               0
LUNG_CANCER              0
dtype: int64

## Seperating The Features and The Target Label

In [38]:
x_train= df_train.drop(["LUNG_CANCER"], axis=1)
y_train= df_train["LUNG_CANCER"]

x_test= df_test.drop(["LUNG_CANCER"], axis=1)
y_test= df_test["LUNG_CANCER"]




print("x_train size:",x_train.shape)
print("x_test size:",x_test.shape)
print("y_train size:",y_train.shape)
print("y_test size:",y_test.shape)

x_train size: (231, 15)
x_test size: (78, 15)
y_train size: (231,)
y_test size: (78,)


## Encoding Categorical Attributes

In [39]:
# Implenting "The Hot Encoder"  

man= sklearn.preprocessing.OneHotEncoder(handle_unknown="ignore") ## creating encoding object
man.fit(x_train) ## fitting the econdinng on training data


#Encoding the categorical attributes of training Data
x_train= man.transform(x_train)

#Encoding the categorical attributes of test data
x_test= man.transform(x_test)

print("x_train:", x_train.shape)
print("x_test:", x_test.shape)

x_train: (231, 65)
x_test: (78, 65)


## Standardlization

In [40]:
# Building a standardization model
scaler= sklearn.preprocessing.StandardScaler(with_mean=False)
scaler.fit(x_train)

# Scaling the training features
x_train= scaler.transform(x_train)

#Reducing the number of test features
x_test= scaler.transform(x_test)

print("x_train:", x_train.shape)
print("x_test:", x_test.shape)

x_train: (231, 65)
x_test: (78, 65)


## Imbalanced Classes in pca the most imp component is n_

In [41]:
sm = imblearn.over_sampling.SMOTE(random_state=20) 
x_train_resampled, y_train_resampled = sm.fit_resample(x_train, y_train)
print("x_train_resampled size", x_train_resampled.shape)
print("y_train_resampled size", y_train_resampled.shape)

x_train_resampled size (412, 65)
y_train_resampled size (412,)


## Choosing The algorithms

In [42]:
## I chose 4 different algorithms : SVM--DECISION TREES--LOGISTIC REGRESSION--KNN


model_svm= sklearn.svm.SVC()
model_decision_trees= sklearn.tree.DecisionTreeClassifier()
model_logistic_regression= sklearn.linear_model.LogisticRegression()
model_KNN=sklearn.neighbors.KNeighborsClassifier()

## Defining a Grid

In [43]:
## Importing svm's method and its hyperparameters


parameters_svm= {'kernel':( 'linear', 'rbf') , 'C':[0.1,1,2]}
model_svm=sklearn.svm.SVC()
clf_svm= sklearn.model_selection.GridSearchCV(model_svm,parameters_svm,  scoring='accuracy', cv= 5, n_jobs=-1)
clf_svm.fit(x_train, y_train)

In [44]:
y_preds = clf_svm.predict(x_test)

In [45]:

accuracy = sklearn.metrics.accuracy_score(y_preds, y_test)

In [46]:
print(accuracy)

0.8205128205128205


In [47]:
## Importing Decision Tree's method and its hyperparameters


parameters_decision_trees= {'criterion':('entropy','gini'), 'max_depth':[2,4,7]}
model_decision_trees= sklearn.tree.DecisionTreeClassifier()
clf_decision_trees=sklearn.model_selection.GridSearchCV(model_decision_trees,parameters_decision_trees, scoring='accuracy', cv= 5, n_jobs=-1)
clf_decision_trees.fit(x_train, y_train)
y_preds_decision_trees= clf_decision_trees.predict(x_test)

In [48]:
y_preds= clf_decision_trees.predict(x_test)

In [49]:

accuracy = sklearn.metrics.accuracy_score(y_preds, y_test)
print(accuracy)

0.8461538461538461


In [50]:
## Importing Logistic Regression's method and its hyperparameters


parameters_logistic_regression={ 'C' :[0.1,1.0,3], 'random_state' : [20]}

model_logistic_regression= sklearn.linear_model.LogisticRegression()

clf_logistic_regression=sklearn.model_selection.GridSearchCV(model_logistic_regression,parameters_logistic_regression, scoring='accuracy', cv= 5, n_jobs=-1)


In [51]:
clf_logistic_regression.fit(x_train, y_train)

In [52]:
y_preds= clf_logistic_regression.predict(x_test)

In [53]:
accuracy = sklearn.metrics.accuracy_score(y_preds, y_test)
print(accuracy)

0.9102564102564102


In [54]:
## Importing KNN's method and its hyperparameters


parameters_KNN= {'weights':['uniform'], 'n_neighbors':[2,5]}
model_KNN= sklearn.neighbors.KNeighborsClassifier()
clf_KNN=sklearn.model_selection.GridSearchCV(model_KNN,parameters_KNN, scoring='accuracy', cv= 5, n_jobs=-1)
clf_KNN.fit(x_train, y_train)
y_preds_KNN= clf_KNN.predict(x_test)

In [55]:
y_preds= clf_KNN.predict(x_test)

In [56]:
accuracy = sklearn.metrics.accuracy_score(y_preds, y_test)
print(accuracy)

0.8461538461538461


In [57]:
## Upon trying different method algorithms it is shown that the logistic regression model was the highest in terms of accuracy.

## Logistic Regression /Tuning The Hyperparameters

In [58]:
parameters_logistic_regression={ 'C' :[0.1,1.0,3] ,'random_state' : [20]}

model_logistic_regression= sklearn.linear_model.LogisticRegression()

clf_logistic_regression=sklearn.model_selection.GridSearchCV(model_logistic_regression,parameters_logistic_regression, scoring='accuracy', cv= 5, n_jobs=-1)

clf_logistic_regression.fit(x_train, y_train)
y_pred_final= clf_logistic_regression.predict(x_test)
print("The best hyperparameters of logistic regression algorithm= {}".format(clf_logistic_regression.best_params_))
accuracy = sklearn.metrics.accuracy_score(y_pred_final, y_test)
print(accuracy)

The best hyperparameters of logistic regression algorithm= {'C': 1.0, 'random_state': 20}
0.9102564102564102


In [59]:
parameters_logistic_regression={ 'C' :[0.1,0.18, 0.2,0.5], 'random_state' : [20]}

model_logistic_regression= sklearn.linear_model.LogisticRegression()

clf_logistic_regression=sklearn.model_selection.GridSearchCV(model_logistic_regression,parameters_logistic_regression, scoring='accuracy', cv= 5, n_jobs=-1)

clf_logistic_regression.fit(x_train, y_train)
y_pred_final= clf_logistic_regression.predict(x_test)

print("The best hyperparameters of logistic regression algorithm= {}".format(clf_logistic_regression.best_params_))

accuracy = sklearn.metrics.accuracy_score(y_pred_final, y_test)
print(accuracy)


The best hyperparameters of logistic regression algorithm= {'C': 0.5, 'random_state': 20}
0.9230769230769231


In [60]:
parameters_logistic_regression={ 'C' :[ 0.5,0.7, 0.8,0.9, 1, 1.3], 'random_state' : [20],'multi_class':['auto','ovr','multinomial'] }

model_logistic_regression= sklearn.linear_model.LogisticRegression()

clf_logistic_regression=sklearn.model_selection.GridSearchCV(model_logistic_regression,parameters_logistic_regression, scoring='accuracy', cv= 5, n_jobs=-1)

clf_logistic_regression.fit(x_train, y_train)
y_pred_final= clf_logistic_regression.predict(x_test)

print("The best hyperparameters of logistic regression algorithm= {}".format(clf_logistic_regression.best_params_))

accuracy = sklearn.metrics.accuracy_score(y_pred_final, y_test)
print(accuracy)


The best hyperparameters of logistic regression algorithm= {'C': 0.5, 'multi_class': 'multinomial', 'random_state': 20}
0.9102564102564102
