<a href="https://colab.research.google.com/github/Jhansipothabattula/Machine_Learning/blob/main/Day46.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Optimization Project-Building and Tuning a Final Model

**Applying All Learned Tuning and Optimization Techniques**
- Comprehensive Model Optimization

  - Data Preprocessing:

      - Ensure data is clean, scaled, and encoded appropriately

- Feature Engineering

  - Derive new features and select the most important ones

- Regularization

  - Avoid overfitting by penalizing complex models

- Cross-Validation

  - Use techniques like K-Fold or Stratified K-Fold for robust performance metrics
- Hyperparameter Tuning

  - Use methods like GridSearchCV, RandomizedSearchCV, or Bayesian Optimization

**Evaluating and Interpreting Model PerformanceModel performance**
- PerformanceMetrics
  - Classification
     - Accuracy, Precision, Recall, F1-Score, ROC-AUC

  - Regression
     - Mean Squared Error (MSE), Mean Absolute Error (MAE), R^2

  - Importance of Interpretability
    - Use feature importance and coefficient analysis for transparency

**Objective**

- Build, tune, and Optimize a machine Learning model using a structured process and evaluate it's perfomance comprehensively

In [1]:
import pandas as pd
from google.colab import files
uploaded = files.upload()

Saving churns.txt to churns.txt


In [4]:
import numpy as np
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load Dataset
df = pd.read_csv("churns.txt")

# Display dataset info
print("Dataset info: \n")
print(df.info())
print("Class Distribution: \n")
print(df["Churn"].value_counts())
print("Sample dara: \n", df.head())

# Handle misiing values
df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors="coerce")
df.dropna(inplace=True)

# Encode categorical variables
label_encoder = LabelEncoder()
for column in df.select_dtypes(include=["object"]).columns:
  if column != "Churn":
    df[column] = label_encoder.fit_transform(df[column])

# Encode target variables
df["Churn"] = label_encoder.fit_transform(df["Churn"])

# Scale Numerical features
scaler = StandardScaler()
numerical_features = ["tenure", "MonthlyCharges", "TotalCharges"]
df[numerical_features] = scaler.fit_transform(df[numerical_features])

# Define features (X) and target (y)
X = df.drop(["Churn", "customerID"], axis=1)
y = df["Churn"]

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train initial model
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)

# Evaluate initial model
y_pred = rf_model.predict(X_test)
accuracy_initial = accuracy_score(y_test, y_pred)

print(f"Initial Model Accuracy: {accuracy_initial:.4f}")
print("Classification Report: \n", classification_report(y_test, y_pred))

# Define parameter grid
param_dist = {
    "n_estimators": np.arange(50, 200, 10),
    "max_depth":[None, 5, 10, 15],
    "min_samples_split":[2, 5, 10, 20],
    "min_samples_leaf":[1, 2, 4]
}

# Initialize RandomizedSearch Classifier
random_search = RandomizedSearchCV(
    estimator=RandomForestClassifier(random_state=42),
    param_distributions=param_dist,
    n_iter=20,
    cv=5,
    scoring="accuracy",
    n_jobs=-1,
    random_state=42
)

# Perform Randomized Search
random_search.fit(X_train, y_train)

# Get best parameters
best_params = random_search.best_params_
print("Best Parameters: \n", best_params)

# Train best model
best_model = random_search.best_estimator_

# Predict and Evaluate
y_pred_best = best_model.predict(X_test)
accuracy_best = accuracy_score(y_test, y_pred_best)

print(f"Best Model Accuracy: {accuracy_best:.4f}")
print("Classification Report: \n", classification_report(y_test, y_pred_best))

Dataset info: 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non