# TASK-1: PREDICTIVE MODEL

## OBJECTIVE:
- Preprocess the data (handle categorical variables, feature scaling).
- Train and test multiple classification models (e.g.,Decision Trees, Logistic Regression, Random Forest).
- Evaluate models using accuracy, precision, recall, and F1-score.
- Perform hyperparameter tuning using grid search.
Tools: Python, scikit-learn, pandas, matplotlib.

### 1. IMPORTING LIBRARIES

In [4]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score

### 2. READING THE CSV FILE

In [6]:
churn=pd.read_csv(r'churn-bigml-80.csv')
print(churn)

     State  Account length  Area code International plan Voice mail plan  \
0       KS             128        415                 No             Yes   
1       OH             107        415                 No             Yes   
2       NJ             137        415                 No              No   
3       OH              84        408                Yes              No   
4       OK              75        415                Yes              No   
...    ...             ...        ...                ...             ...   
2661    SC              79        415                 No              No   
2662    AZ             192        415                 No             Yes   
2663    WV              68        415                 No              No   
2664    RI              28        510                 No              No   
2665    TN              74        415                 No             Yes   

      Number vmail messages  Total day minutes  Total day calls  \
0                   

### 3.DATA PREPROCESSING

#### a. CHECKING NULL VALUES

In [9]:
churn.isnull().sum()

State                     0
Account length            0
Area code                 0
International plan        0
Voice mail plan           0
Number vmail messages     0
Total day minutes         0
Total day calls           0
Total day charge          0
Total eve minutes         0
Total eve calls           0
Total eve charge          0
Total night minutes       0
Total night calls         0
Total night charge        0
Total intl minutes        0
Total intl calls          0
Total intl charge         0
Customer service calls    0
Churn                     0
dtype: int64

#### b. VERIFYING DATA TYPES

In [11]:
churn.dtypes

State                      object
Account length              int64
Area code                   int64
International plan         object
Voice mail plan            object
Number vmail messages       int64
Total day minutes         float64
Total day calls             int64
Total day charge          float64
Total eve minutes         float64
Total eve calls             int64
Total eve charge          float64
Total night minutes       float64
Total night calls           int64
Total night charge        float64
Total intl minutes        float64
Total intl calls            int64
Total intl charge         float64
Customer service calls      int64
Churn                        bool
dtype: object

#### c. ENCODING CATEGORICAL VALUES

In [13]:
le=LabelEncoder()
churn['Churn']=le.fit_transform(churn['Churn'])

In [14]:
le1=LabelEncoder()
churn['State']=le1.fit_transform(churn['State'])

In [15]:
le2=LabelEncoder()
churn['International plan']=le2.fit_transform(churn['International plan'])

In [16]:
le3=LabelEncoder()
churn['Voice mail plan']=le3.fit_transform(churn['Voice mail plan'])

In [17]:
churn.head()

Unnamed: 0,State,Account length,Area code,International plan,Voice mail plan,Number vmail messages,Total day minutes,Total day calls,Total day charge,Total eve minutes,Total eve calls,Total eve charge,Total night minutes,Total night calls,Total night charge,Total intl minutes,Total intl calls,Total intl charge,Customer service calls,Churn
0,16,128,415,0,1,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,0
1,35,107,415,0,1,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,0
2,31,137,415,0,0,0,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0,0
3,35,84,408,1,0,0,299.4,71,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2,0
4,36,75,415,1,0,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3,0


#### d. FEATURE SCALING

In [19]:
sc=StandardScaler()

In [20]:
churn_scaled=sc.fit_transform(churn)

In [21]:
df=pd.DataFrame(churn_scaled,columns=churn.columns,index=churn.index)

In [22]:
churn.head()

Unnamed: 0,State,Account length,Area code,International plan,Voice mail plan,Number vmail messages,Total day minutes,Total day calls,Total day charge,Total eve minutes,Total eve calls,Total eve charge,Total night minutes,Total night calls,Total night charge,Total intl minutes,Total intl calls,Total intl charge,Customer service calls,Churn
0,16,128,415,0,1,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,0
1,35,107,415,0,1,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,0
2,31,137,415,0,0,0,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0,0
3,35,84,408,1,0,0,299.4,71,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2,0
4,36,75,415,1,0,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3,0


### 4. TRAIN TEST SPLIT

In [24]:
x=churn.drop(columns='Churn')
y=churn['Churn']

In [25]:
x.head()

Unnamed: 0,State,Account length,Area code,International plan,Voice mail plan,Number vmail messages,Total day minutes,Total day calls,Total day charge,Total eve minutes,Total eve calls,Total eve charge,Total night minutes,Total night calls,Total night charge,Total intl minutes,Total intl calls,Total intl charge,Customer service calls
0,16,128,415,0,1,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1
1,35,107,415,0,1,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1
2,31,137,415,0,0,0,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0
3,35,84,408,1,0,0,299.4,71,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2
4,36,75,415,1,0,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3


In [26]:
y.head()

0    0
1    0
2    0
3    0
4    0
Name: Churn, dtype: int64

In [27]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=42)

### 5. MODEL IMPLEMENTATION,HYPER PARAMETER TUNING & EVALUATION

#### DECISION TREE

In [30]:
decision=DecisionTreeClassifier()
decision.fit(x_train,y_train)

In [31]:
y_pred=decision.predict(x_test)

##### EVALUATION: 

In [33]:
print('Decision Tree Evaluation:\n')
print('Accuracy Score:',accuracy_score(y_test,y_pred))
print('Precision:',precision_score(y_test,y_pred))
print('Recall:',recall_score(y_test,y_pred))
print('F1 Score:',f1_score(y_test,y_pred))

Decision Tree Evaluation:

Accuracy Score: 0.9026217228464419
Precision: 0.6956521739130435
Recall: 0.6075949367088608
F1 Score: 0.6486486486486487


##### HYPER PARAMETER TUNING

In [35]:
param_grid_dec={
    'splitter':['best','random'],
    'max_depth':[10,20,None],
    'min_samples_split':[2,3,4,5],
    'min_samples_leaf':[1,2,3]
}
grid_decision=GridSearchCV(DecisionTreeClassifier(),param_grid_dec,cv=3,scoring='accuracy')
grid_decision.fit(x_train,y_train)

In [36]:
print('Best Parameters for Decision Tree:\n',grid_decision.best_params_)

Best Parameters for Decision Tree:
 {'max_depth': 10, 'min_samples_leaf': 2, 'min_samples_split': 5, 'splitter': 'best'}


##### EVALUATION BASED ON HYPER PARAMETER TUNING

In [38]:
best_decision=grid_decision.best_estimator_

In [39]:
y_pred_best=best_decision.predict(x_test)

In [40]:
print('Decision Tree Evaluation using hyper parameter tuning :\n')
print('Accuracy Score:',accuracy_score(y_test,y_pred_best))
print('Precision:',precision_score(y_test,y_pred_best))
print('Recall:',recall_score(y_test,y_pred_best))
print('F1 Score:',f1_score(y_test,y_pred_best))

Decision Tree Evaluation using hyper parameter tuning :

Accuracy Score: 0.9325842696629213
Precision: 0.9387755102040817
Recall: 0.5822784810126582
F1 Score: 0.71875


#### LOGISTIC REGRESSION

In [42]:
x1=churn.drop(columns='Churn')
y1=churn['Churn']

In [43]:
x_train1,x_test1,y_train1,y_test1=train_test_split(x1,y1,test_size=0.2,random_state=42)

In [44]:
lr=LogisticRegression(max_iter=5000)
lr.fit(x_train1,y_train1)

In [45]:
y_pred1=lr.predict(x_test1)

##### EVALUATION:

In [47]:
print('Logistic Regression Evaluation:\n')
print('Accuracy Score:',accuracy_score(y_test1,y_pred1))
print('Precision:',precision_score(y_test1,y_pred1))
print('Recall:',recall_score(y_test1,y_pred1))
print('F1 Score:',f1_score(y_test1,y_pred1))

Logistic Regression Evaluation:

Accuracy Score: 0.8632958801498127
Precision: 0.6
Recall: 0.22784810126582278
F1 Score: 0.3302752293577982


##### HYPER PARAMETER TUNING

In [49]:
param_grid_lr={
    'penalty':['l1','l2'],
    'solver':['liblinear'],
    'C':[0.01,0.1,1,10],
}
grid_lr=GridSearchCV(LogisticRegression(max_iter=5000),param_grid_lr,cv=5,scoring='accuracy',n_jobs=-1,verbose=1)
grid_lr.fit(x_train1,y_train1)

Fitting 5 folds for each of 8 candidates, totalling 40 fits


In [50]:
print('Best Parameters for Logistic Regression:\n',grid_lr.best_params_)

Best Parameters for Logistic Regression:
 {'C': 10, 'penalty': 'l1', 'solver': 'liblinear'}


In [51]:
best_lr=grid_lr.best_estimator_

In [52]:
y_pred1_best=best_lr.predict(x_test1)

##### EVALUATION BASED ON HYPERPARAMETER TUNING

In [54]:
print('Logistic Regression Evaluation:\n')
print('Accuracy Score:',accuracy_score(y_test1,y_pred1_best))
print('Precision:',precision_score(y_test1,y_pred1_best))
print('Recall:',recall_score(y_test1,y_pred1_best))
print('F1 Score:',f1_score(y_test1,y_pred1_best))

Logistic Regression Evaluation:

Accuracy Score: 0.8595505617977528
Precision: 0.5625
Recall: 0.22784810126582278
F1 Score: 0.32432432432432434


#### RANDOM FOREST

In [56]:
x2=churn.drop(columns='Churn')
y2=churn['Churn']

In [57]:
x_train2,x_test2,y_train2,y_test2=train_test_split(x2,y2,test_size=0.2,random_state=42)

In [58]:
random=RandomForestClassifier()
random.fit(x_train2,y_train2)

In [59]:
y_pred2=random.predict(x_test2)

##### EVALUATION

In [61]:
print('Random Forest Evaluation:\n')
print('Accuracy Score:',accuracy_score(y_test2,y_pred2))
print('Precision:',precision_score(y_test2,y_pred2))
print('Recall:',recall_score(y_test2,y_pred2))
print('F1 Score:',f1_score(y_test2,y_pred2))

Random Forest Evaluation:

Accuracy Score: 0.951310861423221
Precision: 1.0
Recall: 0.6708860759493671
F1 Score: 0.803030303030303


##### HYPERPARAMETER TUNING

In [63]:
param_grid_rf={
    'n_estimators':[20,50,100],
    'max_depth':[None,5,10,20],
    'min_samples_split':[2,3,5],
    'min_samples_leaf':[2,3,5]
}
grid_rf=GridSearchCV(RandomForestClassifier(),param_grid_rf,cv=5,n_jobs=-1)
grid_rf.fit(x_train2,y_train2)

In [64]:
print('Best Parameters:\n',grid_rf.best_params_)

Best Parameters:
 {'max_depth': 20, 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 100}


In [65]:
best_rf=grid_rf.best_estimator_

In [66]:
y_pred2_best=best_rf.predict(x_test2)

##### EVALUATION BASED ON HYPERPARAMETER TUNING

In [68]:
print('Random Forest Evaluation:\n')
print('Accuracy Score:',accuracy_score(y_test2,y_pred2_best))
print('Precision:',precision_score(y_test2,y_pred2_best))
print('Recall:',recall_score(y_test2,y_pred2_best))
print('F1 Score:',f1_score(y_test2,y_pred2_best))

Random Forest Evaluation:

Accuracy Score: 0.949438202247191
Precision: 0.9814814814814815
Recall: 0.6708860759493671
F1 Score: 0.7969924812030075
