### What You're Aiming For

    1 Dataset Selection:
        Head over to Kaggle and choose a dataset that aligns with your interests. Ensure it involves either a classification or regression task.
        Real-world data can be messy, and that's perfectly fine! You're here to tame it.
    2 Data Preprocessing:
        Identify and handle missing values within your dataset. Employ effective strategies for dealing with missing data.
        Implement data cleaning, formatting, and organization to prepare your dataset for training.
    3 Feature Engineering:
        Enhance your model's performance by creating new features or transforming existing ones.
        Tailor your feature engineering techniques to address the specific needs of your chosen project.
    4 Data Visualization:
        Utilize data visualization techniques to gain insights into your dataset.
        Create visualizations that reveal patterns and relationships, aiding your understanding of the data.
    5 Model Selection:
        Choose the right model based on the nature of your problem.
        Consider factors such as the task type (classification or regression), dataset size, and alignment with algorithm assumptions.
    6 Model Evaluation:
        Evaluate your model's performance using appropriate metrics for the chosen task (accuracy, classification reports, confusion matrices).
        Use appropriate methods: hyper parameter tuning, cross validation, etc.
        Justify your model selection and discuss the implications of your results.
    7 Project Submission:
        Share your code and findings in the assignment section. Provide clear documentation and explanations.
        Highlight any challenges faced during the project and how you overcame them.

### Instructions

    What dataset did you use?
    Explain your project! I want to see the thinking behind the code essentially.

e.g. why this project interested you, the steps you took while doing the project, how you decided what model to use.



In [5]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

In [7]:
df = pd.read_csv("Customer-Churn-Records.csv")

In [9]:
df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Complain,Satisfaction Score,Card Type,Point Earned
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1,1,2,DIAMOND,464
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0,1,3,DIAMOND,456
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1,1,3,DIAMOND,377
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0,0,5,GOLD,350
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0,0,5,GOLD,425


In [11]:
df = df.drop(columns=['RowNumber', 'CustomerId', 'Surname'], axis=1)

In [13]:
le = LabelEncoder()

In [15]:
for col in ['Geography', 'Gender', 'Card Type']:
    if col in df.columns:
        df[col] = le.fit_transform(df[col])

In [17]:
df.head()

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Complain,Satisfaction Score,Card Type,Point Earned
0,619,0,0,42,2,0.0,1,1,1,101348.88,1,1,2,0,464
1,608,2,0,41,1,83807.86,1,0,1,112542.58,0,1,3,0,456
2,502,0,0,42,8,159660.8,3,1,0,113931.57,1,1,3,0,377
3,699,0,0,39,1,0.0,2,0,0,93826.63,0,0,5,1,350
4,850,2,0,43,2,125510.82,1,1,1,79084.1,0,0,5,1,425


In [19]:
X = df.drop(columns='Exited', axis=1)
y = df['Exited']

In [29]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=30)

In [31]:
smote = SMOTE()
X_train_smote, y_tran_smote = smote.fit_resample(X_train, y_train)

In [33]:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=30)

In [35]:
pipeline = Pipeline([
    ('ss', StandardScaler()),
    ('rfc', RandomForestClassifier())
])

In [37]:
params = {
    'rfc__n_estimators': range(20, 100, 20),
    'rfc__max_depth': range(10, 50, 10)
}

In [39]:
model = GridSearchCV(
    pipeline,
    param_grid = params,
    cv = cv,
    n_jobs = 5,
    verbose = 1)

In [41]:
model.fit(X_train_smote, y_tran_smote)

Fitting 5 folds for each of 16 candidates, totalling 80 fits


In [47]:
model.score(X_train_smote, y_tran_smote)

0.9990622069396686

In [43]:
y_pred = model.predict(X_test)

In [45]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      1564
           1       1.00      1.00      1.00       436

    accuracy                           1.00      2000
   macro avg       1.00      1.00      1.00      2000
weighted avg       1.00      1.00      1.00      2000



In [51]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   CreditScore         10000 non-null  int64  
 1   Geography           10000 non-null  int64  
 2   Gender              10000 non-null  int64  
 3   Age                 10000 non-null  int64  
 4   Tenure              10000 non-null  int64  
 5   Balance             10000 non-null  float64
 6   NumOfProducts       10000 non-null  int64  
 7   HasCrCard           10000 non-null  int64  
 8   IsActiveMember      10000 non-null  int64  
 9   EstimatedSalary     10000 non-null  float64
 10  Exited              10000 non-null  int64  
 11  Complain            10000 non-null  int64  
 12  Satisfaction Score  10000 non-null  int64  
 13  Card Type           10000 non-null  int64  
 14  Point Earned        10000 non-null  int64  
dtypes: float64(2), int64(13)
memory usage: 1.1 MB


I chose this ptoject cause in interested finacial datasets and i love solving financial problems.

In this code i first dropped columns that were not necessary in building the model, then i encoded categorical columns using label encoder. I then split my data into target and feature, next i split the data into training and testing and then i used smote to handle imblanced data. I used a pipeline containing standastd scaler and the model random forest classifier. i finally trained the model.