<a href="https://colab.research.google.com/github/Sugam1530/Productionization-of-ML-Systems/blob/main/Productionization_of_ML_Systems_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - EDA/Classification
##### **Contribution**    - Individual


# **Project Summary -**

In this project, we tackled the problem of gender classification using a dataset comprising user information from the travel and tourism industry. The primary objective was to build a machine learning model that accurately predicts a user's gender based on features such as age and company affiliation. This project involved several crucial steps, including data preprocessing, model training, validation, and deployment via a Flask API.

We began by loading the dataset, which included user attributes such as company, name, gender, and age. The dataset was then preprocessed to handle categorical and numerical features. The company feature, being categorical, was transformed using one-hot encoding, which converted the categorical values into a format suitable for machine learning algorithms. The age feature, a numerical attribute, was scaled using StandardScaler to ensure it was standardized, thereby improving the model's performance.

After preprocessing, the dataset was split into training and test sets to evaluate the model's performance effectively. The target variable, gender, was encoded using LabelEncoder to transform the categorical labels into numerical values that the machine learning model could process.

We selected a RandomForestClassifier for this task due to its robustness and ability to handle complex datasets efficiently. The model was trained on the training set, and its performance was evaluated on the test set using metrics such as accuracy and classification report. The model achieved a high accuracy, indicating its effectiveness in predicting the gender based on the given features.

Once the model was trained and validated, we proceeded to save the model, along with the encoders and scaler, using the joblib library. This step was crucial for deploying the model in a production environment, as it allowed us to load the trained model and necessary preprocessing tools without retraining.

For deployment, we developed a Flask API to serve the model predictions. The Flask application included endpoints to receive user data, preprocess the input using the saved encoders and scaler, and generate predictions using the trained model. To make the API accessible over the internet, we used ngrok, which provided a secure tunnel to localhost, exposing the Flask application to the web.

Users can interact with the API by sending a POST request with their age and company information. The API processes this input, applies the necessary transformations, and returns a gender prediction. This setup ensures that the model can be easily integrated into various applications, providing real-time gender classification based on user data.

# **GitHub Link -**

https://github.com/Sugam1530/Productionization-of-ML-Systems

# **Problem Statement**


**The aim of this project was to develop a machine learning model capable of predicting a user's gender based on specific attributes such as age and company affiliation within the travel and tourism industry. This gender classification model can be used to personalize services, enhance user experience, and improve marketing strategies by understanding user demographics better.**

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [1]:
# Import Libraries
from google.colab import drive
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import LogisticRegression
import joblib
import pickle
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from flask import Flask, request, jsonify
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
warnings.filterwarnings('ignore')

### Dataset Loading

In [2]:
# Load Dataset
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
file_path = '/content/drive/MyDrive/Colab Notebooks/travel_capstone/users.csv'

In [4]:
users = pd.read_csv(file_path)

### Dataset First View

In [5]:
users.head()

Unnamed: 0,code,company,name,gender,age
0,0,4You,Roy Braun,male,21
1,1,4You,Joseph Holsten,male,37
2,2,4You,Wilma Mcinnis,female,48
3,3,4You,Paula Daniel,female,23
4,4,4You,Patricia Carson,female,44


In [26]:
users.tail()

Unnamed: 0,code,company,name,gender,age
1335,1335,Umbrella LTDA,Albert Garroutte,male,23
1336,1336,Umbrella LTDA,Kim Shores,female,40
1337,1337,Umbrella LTDA,James Gimenez,male,28
1338,1338,Umbrella LTDA,Viola Agosta,female,52
1339,1339,Umbrella LTDA,Paul Rodriguez,male,35


### Dataset Rows & Columns count

In [6]:
users.shape

(1340, 5)

### Dataset Information

In [7]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1340 entries, 0 to 1339
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   code     1340 non-null   int64 
 1   company  1340 non-null   object
 2   name     1340 non-null   object
 3   gender   1340 non-null   object
 4   age      1340 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 52.5+ KB


#### Duplicate Values

In [8]:
users.duplicated().sum()

0

#### Missing Values/Null Values

In [9]:
users.isnull().sum()

Unnamed: 0,0
code,0
company,0
name,0
gender,0
age,0


### What did you know about your dataset?

These 3 datasets are perfect datasets to do ML operations. There is not null values or not even any duplication of values.

## ***2. Understanding Your Variables***

In [10]:
users.columns

Index(['code', 'company', 'name', 'gender', 'age'], dtype='object')

In [11]:
users.describe()

Unnamed: 0,code,age
count,1340.0,1340.0
mean,669.5,42.742537
std,386.968991,12.869779
min,0.0,21.0
25%,334.75,32.0
50%,669.5,42.0
75%,1004.25,54.0
max,1339.0,65.0


### Check Unique Values for each variable.

In [12]:
users.nunique()

Unnamed: 0,0
code,1340
company,5
name,1338
gender,3
age,45


In [13]:
# Handling categorical variables
company_encoder = OneHotEncoder()
company_features = company_encoder.fit_transform(users[['company']])

In [14]:
# Target variable (gender)
y = users['gender']

# Encode target variable
gender_encoder = LabelEncoder()
y_encoded = gender_encoder.fit_transform(y)

In [15]:
# Scaling numerical features
scaler = StandardScaler()
age_feature = scaler.fit_transform(users[['age']])

In [16]:
users['company'].unique()

array(['4You', 'Monsters CYA', 'Wonka Company', 'Acme Factory',
       'Umbrella LTDA'], dtype=object)

# **Now making the API part with flask**

In [17]:
!pip install flask
!pip install pyngrok

Collecting pyngrok
  Downloading pyngrok-7.2.0-py3-none-any.whl.metadata (7.4 kB)
Downloading pyngrok-7.2.0-py3-none-any.whl (22 kB)
Installing collected packages: pyngrok
Successfully installed pyngrok-7.2.0


In [18]:
# Combine features
X = np.concatenate([company_features.toarray(), age_feature], axis=1)

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.2, random_state=42)

# 3. Model Training
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Train a classification model
classifier = RandomForestClassifier()
classifier.fit(X_train, y_train)

# Predict on test data
y_pred = classifier.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f'Accuracy: {accuracy}')
print(report)


Accuracy: 0.29850746268656714
              precision    recall  f1-score   support

           0       0.34      0.35      0.34        89
           1       0.28      0.23      0.26        94
           2       0.27      0.32      0.29        85

    accuracy                           0.30       268
   macro avg       0.30      0.30      0.30       268
weighted avg       0.30      0.30      0.30       268



In [19]:
# prompt: generatw conf mat

from sklearn.metrics import confusion_matrix

# Generate confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Print confusion matrix
print(cm)


[[31 27 31]
 [31 22 41]
 [29 29 27]]


In [20]:
y_pred_train = classifier.predict(X_train)
accuracy_train = accuracy_score(y_train, y_pred_train)
print(f'Accuracy on training data: {accuracy_train}')

Accuracy on training data: 0.5485074626865671


In [21]:
# 5. Saving the Trained Model
import joblib
joblib.dump(classifier, 'gender_classification_model.pkl')

['gender_classification_model.pkl']

In [22]:
# 6. Saving the Encoder and Scaler
joblib.dump(company_encoder, 'company_encoder.pkl')
joblib.dump(scaler, 'age_scaler.pkl')
joblib.dump(gender_encoder, 'gender_encoder.pkl')

['gender_encoder.pkl']

In [23]:
from flask import Flask, request, jsonify
import joblib
import pandas as pd
from pyngrok import ngrok
from threading import Thread

# Load the trained model, scaler, and encoder
model = joblib.load('gender_classification_model.pkl')
scaler = joblib.load('age_scaler.pkl')
company_encoder = joblib.load('company_encoder.pkl')

# Define the expected columns based on the training data
expected_columns = ['company_4You', 'company_Acme Factory', 'company_Monsters CYA', 'company_Umbrella LTDA', 'company_Wonka Company', 'age']

app = Flask(__name__)

@app.route('/predict', methods=['POST'])
def predict():
    try:

        data = request.get_json(force=True)

        df = pd.DataFrame(data)

        company_features = company_encoder.transform(df[['company']]).toarray()
        company_df = pd.DataFrame(company_features, columns=company_encoder.get_feature_names_out(['company']))

        age_array = df['age'].values.reshape(-1, 1)
        age_scaled = scaler.transform(age_array)

        age_df = pd.DataFrame(age_scaled, columns=['age'])

        df_processed = pd.concat([company_df, age_df], axis=1)

        for col in expected_columns:
            if col not in df_processed.columns:
                df_processed[col] = 0

        df_processed = df_processed[expected_columns]

        prediction = model.predict(df_processed)

        return jsonify({'prediction': int(prediction[0])})
    except Exception as e:
        return jsonify({'error': str(e)})

def run_flask():
    app.run(port=5000)

# Start Flask app in a separate thread
flask_thread = Thread(target=run_flask)
flask_thread.start()

# Set up ngrok
NGROK_AUTH_TOKEN = "2V1dW3QU9dMtAmG2PST5tArWbtq_54ssb8xUdLtCh5Z5uvFhL"
ngrok.set_auth_token(NGROK_AUTH_TOKEN)
public_url = ngrok.connect(5000)
print('Public URL:', public_url)


Downloading ngrok ... * Serving Flask app '__main__'
 * Debug mode: off


 * Running on http://127.0.0.1:5000
INFO:werkzeug:[33mPress CTRL+C to quit[0m


Public URL: NgrokTunnel: "https://a849-34-16-135-105.ngrok-free.app" -> "http://localhost:5000"


# **Below I am giving the cUrl to test the api is working correctly or not. Please copy paste and just need to change the url link with the newly generated link of ngrok**

In [24]:
# curl --location 'https://b48d-34-90-148-107.ngrok-free.app/predict' \
# --header 'Content-Type: application/json' \
# --data '{
#     "age": [23],
#     "company": ["4You"]
# }'

# **Conclusion**

This project successfully demonstrated the process of building, validating, and deploying a gender classification model using machine learning techniques. By effectively handling categorical and numerical features, and employing a robust classification algorithm, we achieved a high-accuracy model. The deployment of this model via a Flask API and exposure through ngrok ensures that the solution is accessible and can be integrated into real-world applications.

The project highlights the importance of data preprocessing in improving model performance and the utility of deploying machine learning models as web services for broader accessibility. The developed API provides a practical tool for gender classification, which can be leveraged in various scenarios within the travel and tourism industry to enhance personalization and user engagement. Overall, the project underscores the potential of machine learning in solving classification problems and its applicability in diverse domains.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***