## Assignment 1: Customer's Shopping Data

The goal of this assignment is to develop machine learning models that can predict the spending behavior of customers in a shop based on their characteristics. This prediction can help the shop in targeting its customers with customized strategies based on their spending behaviors, thus increasing its profits. Towards this goal, you should load the customer’s dataset attached to this assignment, develop linear and logistic regression models for spending predictions, and finally evaluate the performance of your trained models on test data.

*Note: Please include comments to your code so it can be easily followed and understood.*

### Loading the Data

The customer’s shopping dataset is split into training and testing data.  The training data will be used to build the linear and logistic regression models, and the test data will be used to evaluate their performance in customer’s spending predictions.


Importing the necessary python libraries to load the data.

In [32]:
#import libraries
import pandas as pd

In [33]:
#Load the train_data.csv file in python, and assign it to a variable named "train".
train = pd.read_csv('train_data.csv')

In [34]:
#Load the test_data.csv file in python, and assign it to a variable named "test".
test = pd.read_csv('test_data.csv')

### Linear Regression

In this section, you will train a linear regression model to predict the spending score of the customers in the shop. You should use the “Spending Score” column as the target variable and all the remaining columns as the independent variables (i.e., features).

*Note: Make sure to exclude the “Spending Category” column in this section of the assignment, given that it is derived from the “Spending Score” column (i.e., to avoid data leakage).*



Importing the necessary python libraries to call the linear regression model and the MSE metric.

In [35]:
#import libraries
import sklearn
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

Dropping the "Spending Category" column from the training and testing data.

In [36]:
#Drop Spending Category column
linear_reg_train = train.drop(columns="Spending Category")
linear_reg_test = test.drop(columns="Spending Category")

In the remaining part of this section, you should use the linear_reg_train and linear_reg_test dataframes instead of the train and test dataframes.

In [37]:
#Train a linear regression model to predict the spending score of a customer
#You should use default values for all parameters

# Import libraries to be able to use categorical columns as well
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

# Extract features (X) and target variable (y) from the training data
X_train = linear_reg_train[['Gender' , 'Age', 'Annual Income', 'Profession' , 'Work Experience', 'Family Size']]
Y_train = linear_reg_train['Spending Score']

# Extract features (X) and target variable (y) from the testing data
X_test = linear_reg_test[['Gender' , 'Age', 'Annual Income', 'Profession' , 'Work Experience', 'Family Size']]
Y_test = linear_reg_test['Spending Score']


# Create a column transformer to handle categorical variables
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(), ['Gender', 'Profession'])
    ],
    remainder='passthrough'
)

# Initialize the Linear Regression model
model = LinearRegression()

# Create a pipeline
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', model)
])

# Train the model on the training data
pipeline.fit(X_train, Y_train)

# Make predictions on the test data
Y_pred = pipeline.predict(X_test)


In [38]:
#Compute the MSE metric to evaluate the trained model using the test data
mse = mean_squared_error(Y_test, Y_pred)

# Print the MSE
print(f'Mean Squared Error: {mse}')

Mean Squared Error: 780.7824887124127


### Logistic Regression

In this section, you will train a logistic regression model to predict the spending category of a customer in the shop, as either High or Low.  The target variable will be the “Spending Category” column, and all the remaining columns will be the independent variables.

*Note: Don’t use the “Spending Score” column in this section of the assignment, given that “Spending Category” column is derived from it.*




Importing the necessary python libraries to call the logistic regression model and the Accuracy metric.



In [39]:
#import libraries
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

Dropping the Spending Score column from the training and testing data.

In [40]:
#Drop Spending Score column
logistic_reg_train = train.drop(columns="Spending Score")
logistic_reg_test = test.drop(columns="Spending Score")

In the remaining part of this section, you should use the logistic_reg_train and logistic_reg_test dataframes instead of the train and test dataframes.

In [41]:
#Train a logistic regression model to predict the spending category of a customer
#You should use default values for all parameters

# Import libraries to be able to use categorical columns as well
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Extract features (X) and target variable (y) from the training data
X_train = logistic_reg_train[['Gender', 'Age', 'Annual Income', 'Profession' , 'Work Experience', 'Family Size']]
Y_train = logistic_reg_train['Spending Category']

# Extract features (X) and target variable (y) from the testing data
X_test = logistic_reg_test[['Gender','Age', 'Annual Income', 'Profession', 'Work Experience', 'Family Size']]
Y_test = logistic_reg_test['Spending Category']

preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(), ['Gender', 'Profession'])
    ],
    remainder='passthrough'
)

#Initialize the Logistic Regression model
model = LogisticRegression()

# Create a pipeline
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', model)
])

# Train the model on the training data
pipeline.fit(X_train, Y_train)

#Make preditions on the test data
Y_pred = pipeline.predict(X_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [42]:
#Compute the Accuracy metric to evaluate trained model on the test data
accuracy = accuracy_score(Y_test , Y_pred)

#Print the accuracy metric
print(f"Accuracy: {accuracy}")


Accuracy: 0.5241730279898219
