Overview:

For my analysis, I chose the "Sales Records" data sets. One small data set (5000 sales records) and one large data set (100000 sales records). Machine learning is a very powerful utility when it comes to sales improvement. It can help better understand customer behavior, predict customer behavior and assist in make better and more robust decision in trying to generate more revenue. Both the large and small data sets have 14 variables. Below is the description of the variables:

1). Region:  Continent where the sales occur (Asia, Europe....) </br>
2). Country: Country where the sales occur</br>
3). item Type: The type of product being sold (clothing, snacks....)</br>
4). Sales channel: How the sale was made (in person or online)</br>
5). Order Priority: The priority order of the sale ( C- Critical, H-High, M-Medium and L-Low)</br>
6). Order Date: The date the order was placed</br>
7). Order ID: Unique identifier for an order</br>
8). Ship Date: The date the particular product has been shipped</br>
9). Units Sold: The quantity of a particular product that was sold</br>
10).Unit Price: The price per each product is sold at</br>
11).Unit Cost:  The cost incurred by the seller per each product sold</br>
12).Total Revenue: The total revenue generated a particular sale</br>
13).Total Cost: The total cost incurred by the seller per a particular transaction (shipping costs,etc....)</br>
14).Total Profit: The total profit generated by the sale</br>

Based on analyzing the content and structure of the two data sets, I believe that the two best suited ML algorithms for the analysis of these two data sets are: Decision Trees and Multinomial Logistic Regression. 

1). Data Exploration:



In [None]:

import numpy as np
import pandas as pd
import seaborn as sea
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier


#Load both the large and small data set from GitHub
large_sales_records = pd.read_csv("https://raw.githubusercontent.com/Data-Vlad/Data-Science/main/Data%20622%20-%20Machine%20Learning%20and%20Big%20Data/Project%20%23%201/100000%20Sales%20Records.csv")
small_sales_records = pd.read_csv("https://raw.githubusercontent.com/Data-Vlad/Data-Science/main/Data%20622%20-%20Machine%20Learning%20and%20Big%20Data/Project%20%23%201/5000%20Sales%20Records.csv")

#Print the first 10 records of each data set
print(large_sales_records)
print(small_sales_records)

#Print the meta data (the column names and data types) of each data set
print(large_sales_records.info())
print(small_sales_records.info())

#print the summary statistics for each data set
print(large_sales_records.describe())
print(small_sales_records.describe())

# I will now perform exploratory data analysis (EDA) by creating visualizations for both the continious and categorical features to better understand the relationship/collinearity
# in both data sets.

#I will now create a grid of histograms to measure the distribution for all of the numeric values in each data set
numeric_eda_large = large_sales_records[['Units Sold', 'Unit Price', 'Unit Cost', 'Total Revenue', 'Total Cost', 'Total Profit']]
fig=plt.figure(figsize=[20,10])
ax = fig.gca()
numeric_eda_large.hist(ax = ax)
plt.subplots_adjust(hspace=0.5)
plt.show()
numeric_eda_small = small_sales_records[['Units Sold', 'Unit Price', 'Unit Cost', 'Total Revenue', 'Total Cost', 'Total Profit']]
fig=plt.figure(figsize=[20,10])
ax = fig.gca()
numeric_eda_small.hist(ax = ax)
plt.subplots_adjust(hspace=0.5)
plt.show()

# I will now create a correlation heatmap/matrix to visualze the relationships between the numeric variables for each data set.
numeric_eda_small = small_sales_records[['Units Sold', 'Unit Price', 'Unit Cost', 'Total Revenue', 'Total Cost', 'Total Profit']]
d = pd.DataFrame(data=numeric_eda_small, columns=list(numeric_eda_small.columns))
corr = d.corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
f, ax = plt.subplots(figsize=(11, 9))
cmap = sea.diverging_palette(230, 20, as_cmap=True)
sea.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,square=True, linewidths=.5, cbar_kws={"shrink": .5})
plt.show()
numeric_eda_large = large_sales_records[['Units Sold', 'Unit Price', 'Unit Cost', 'Total Revenue', 'Total Cost', 'Total Profit']]
d = pd.DataFrame(data=numeric_eda_small, columns=list(numeric_eda_small.columns))
corr = d.corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
f, ax = plt.subplots(figsize=(11, 9))
cmap = sea.diverging_palette(230, 20, as_cmap=True)
sea.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,square=True, linewidths=.5, cbar_kws={"shrink": .5})
plt.show()

#I will now create box plots to show the relationship between a numeric and categorical variables.The first box plot shows the relationship 
# between "Order Priority" and "Unit Cost". The second plot shows the relationship between the "Order Priority" and "Unit Price". The third box plot shows
# the relationship between the "Region" and "Total Profit" variables. The fourth bar plot shows the relationship between the "Region" and "Total Cost" fields.
sea.boxplot( x='Order Priority', y='Unit Cost',data=large_sales_records)
sea.boxplot( x='Order Priority', y='Unit Cost',data=small_sales_records)
sea.boxplot( x='Order Priority', y='Unit Price',data=large_sales_records)
sea.boxplot( x='Order Priority', y='Unit Price',data=small_sales_records)
ax = sea.boxplot( x='Region', y='Total Profit',data=large_sales_records)
ax.tick_params(axis='x', labelrotation=30)
plt.show()
ax = sea.boxplot( x='Region', y='Total Cost',data=small_sales_records)
ax.tick_params(axis='x', labelrotation=30)
plt.show()

#I will now create a bar plot for each data set comparing the Units sold of a product for each Sales Channel (online or offline) by Order Priority. I
# will also create a bar plot for each data set comparing Units Sold for each Item Type by Order Priority.
ax = sea.barplot(data=large_sales_records, x='Sales Channel', y='Units Sold', hue='Order Priority',  errwidth=0)
for i in ax.containers:
    ax.bar_label(i,)
plt.title("Units Sold by Sales Channel and Order Priority")
plt.show()
ax = sea.barplot(data=small_sales_records, x='Sales Channel', y='Units Sold', hue='Order Priority',  errwidth=0)
for i in ax.containers:
    ax.bar_label(i,)
plt.show()
plt.title("Units Sold by Sales Channel and Order Priority")
ax = sea.barplot(data=large_sales_records, x='Item Type', y='Units Sold',  hue='Order Priority',  errwidth=0)
plt.title("Units Sold by Item Type and Order Priority")
plt.show()
ax = sea.barplot(data=small_sales_records, x='Item Type', y='Units Sold',  hue='Order Priority',  errwidth=0)
plt.title("Units Sold by Item Type and Order Priority")
plt.show()


Exporatory Data (EDA) Analysis:


Feature and meta data -
After running the meta data for each data set, I saw that the "Order Date" and "Ship Date" fields in both data sets had incorrect data types. For the purposes of analysis and model building they need to be converted to a "date" data type. I also noticed that all of the categorical variables ("Region","Country","Item Type","Sales Channel" and "Order Priority") have "string" data types. For the purposes of our analysis they need to convert them to the "category" data type. Something else I would propose is to drop the "Country" column. The reason for this is that we want to reduce the amount of unique values in a column. We want groups such as the "Region" column. The "Country" column would not be of much use, especially the we have "Region" groups that they belong to. I would also drop the "Order ID" column because it serves of no importance in our analysis and model building.

Histogram -
 I have also created a grid of histograms for all of the numeric variables in each data set. The reason for doing this is to analyze the frequency distribution of each variable. For both data sets:the "units sold" variable is distributed uniformly, most products for "Unit Price" are sold from the range of $0 to $300. The "Unit Cost" variable is in the range of $0 to $150.It is also worth mentioning that the "Total Revenue", "Total Price" and "Total Profit" variable are right skewed (mean is greater than the mode). The right tail region of the distribution is an outlier which will have a bad affect on a particular models' performance. One way to normalize right-skewed data is through the log transfomration to reduce the outliers in the data. What this tells me is that I should probably exclude the Total Revenue", "Total Price" and "Total Profit" fields from my model creation.

Correlation Matrix/Heatmap -
I then created a correlation matrix for all the numeric variables in each data set. I did this to check for collinearity between variables. If two variables are collinear (have a strong relationship to one another) we would need to leave one and remove the other ones from our model creation in order to improve model accuracy. It turns out that the: "Units Sold", "Unit Price" and "Unit Cost" are highly correlated with the "Total Revenue", "Total Cost" and "Total Profit" fields. When we have colinearity this reduces the efficacy of our model. The question would now be: which variables do I exclude from my analysis: "Units Sold", "Unit Price" and "Unit Cost"  or "Total Revenue", "Total Cost" and "Total Profit"? The Historgram analysis above makes the answer very easy. We will exclude the "Total Revenue", "Total Cost" and "Total Profit" fields since they are all right-skewed and would have a bad affect on model accuracy/efficacy. 

Boxplot-
To further understand the relatioship between some of my categorical and numeric variables, which I deem important for my model building, I have created box plots for each data set. The first box plot I created was one between the "Order Priority" and "Unit Cost" fields. I want to better understand the company spending based on "Order Pritoriy" groups. This will help me understand which group is best to allocate my resources/money. Looking at this box plot we could see that "Medium" and "Low Priority" groups have similiar distributions, both having similiar medians and being right skewed, with "Medium" having more variability. "Critical" and "High"   priority groups also have similiar distributions with similiar medians,being right skewed, with critical having more variability. This tells me that within the four priority groups we can create the: "Low" and "Medium" as well as the "High" and "Critical" sub-groups. We can employ similiar strategies to each sub-group in order to potentially cut costs. We also see that the "Low" and "Medium" sub-group is less skewed than the "Critical" and "High" sub-group meaning there are less outliers so we can create a better model to implement more effective strategies for the "Medium" adn "High" sub-group.
I then created a box plot to analyze the relationship between the "Order Priority" and "Unit Price" fields. I did this to see how my price is affected by priority and where it is best to reallocated by resources in order to employ the best possible strategy to raise the unit price.The first thing I observed is that "Critical","Low" and "Medium" priorities all had a similiar median as well as similiar variability. The "High" priority group has more variability and is more right-skewed than the other three. What this tells me is that we can build a better model and clear understanding of how to deal/build better strategies for the "Critical","Low", and "Medium" priorities. The "High" priority has more outliers and less similarities to be able to draw solid conclusions. However, we do need to take a closer look at the "High" priority group and see the reason for the high variability and outliers in order to better understand the reasons behind this and how it can help improve our model going forward. Another box plot I created was between "Total Profit" and "Region". I was able to see that the "Total Profit" distribution was relatively similiar across all Regions besides "North America". "North America" had higher variability and was more right-skewed than the rest of the regions. It would be interesting to research what accounts for this. For my final box plot, I compared the relationship between "Total Cost" and "Region". I observed that the "Caribbean","Europe", "Asia" and "North Africa" all had similiar variability and very similiarly right skewed (North Africa a little more right-skewed). What this tells me is that the regions stated above can be grouped together and potentially employ similiar cost cutting strategies to them.

Bar plot - 
I now created a bar plot to analyze the "Units Sold" by "Sales Channel" and "Order Priority". What I observed is that for the "Critical" and "Low" priority more units were sold online. I also observed that for the "Medium" and "High Priority" more items were sold offline. the question, I have is why are "Critical" and "High" priority products not sold in the same category (one more is sold offline and one more is sold offline)?  To help cast some light on this question, I will create a "Units Sold by Item Type and Priority" bar plot. I hope to see which the priority of each group of items and compare them to my previous bar plot of whether they were purchased online or offline to see why the "Critical" priority were sold mostly online and "High" priority offline. It looks like Cereal and Snacks were the items which were mostly sold as "High" priority. Personal care and Vegetables were mostly sold as "Critical" priority. After seeing this it makes more sense why most of the "High" priority items w and most of the "Critical" itmes were sold in different "Sales Channel" categories. 



2). Data Preparation:

In [None]:
# I will need to change the data types of several fields in each data set for simplification of analysis. The "Order Date" and "Ship Date" fields will
# be switched from the "object" to the "Date" data types. The categorical fields: "Region","Country","Item Type","Sales Channel" and "Order Priority"
# will be changed to the factor data type
large_sales_records["Order Date"] = pd.to_datetime(large_sales_records["Order Date"])
large_sales_records["Ship Date"] = pd.to_datetime(large_sales_records["Ship Date"])
small_sales_records["Order Date"] = pd.to_datetime(small_sales_records ["Order Date"])
small_sales_records["Ship Date"] = pd.to_datetime(small_sales_records ["Ship Date"])
small_sales_records[['Region','Country','Item Type','Sales Channel','Order Priority']] = small_sales_records[['Region','Country','Item Type','Sales Channel','Order Priority']].astype('category')
small_sales_records[['Region','Country','Item Type','Sales Channel','Order Priority']] = small_sales_records[['Region','Country','Item Type','Sales Channel','Order Priority']].astype('category')
large_sales_records[['Region','Country','Item Type','Sales Channel','Order Priority']] = large_sales_records[['Region','Country','Item Type','Sales Channel','Order Priority']].astype('category')
large_sales_records[['Region','Country','Item Type','Sales Channel','Order Priority']] = large_sales_records[['Region','Country','Item Type','Sales Channel','Order Priority']].astype('category')



#convert ints/floats to objects for purposes of creating a decsion tree
small_sales_records[['Units Sold','Unit Price','Unit Cost']] = small_sales_records[['Units Sold','Unit Price','Unit Cost']].astype('object')
small_sales_records[['Units Sold','Unit Price','Unit Cost']] = small_sales_records[['Units Sold','Unit Price','Unit Cost']].astype('object')
large_sales_records[['Units Sold','Unit Price','Unit Cost']] = large_sales_records[['Units Sold','Unit Price','Unit Cost']].astype('object')
large_sales_records[['Units Sold','Unit Price','Unit Cost']] = large_sales_records[['Units Sold','Unit Price','Unit Cost']].astype('object')



# I will now check both data sets for missing values. It turns out that both data sets have no missing values
large_sales_records.isnull().sum()
small_sales_records.isnull().sum()


 #split the "Ship Date" field into "Ship Month" and "Ship Day" and split the "Order Date" field into "Order Month" and "Order Day" for each data set
#for the purposes of model building
large_sales_records['Ship Month'] = pd.to_datetime(large_sales_records['Ship Date'], errors='coerce')
large_sales_records['Ship Month'] = large_sales_records['Ship Date'].dt.strftime('%m')
small_sales_records['Ship Month'] = pd.to_datetime(small_sales_records['Ship Date'], errors='coerce')
small_sales_records['Ship Month'] = small_sales_records['Ship Date'].dt.strftime('%m')
large_sales_records['Ship Day'] = pd.to_datetime(large_sales_records['Ship Date'], errors='coerce')
large_sales_records['Ship Day'] = large_sales_records['Ship Date'].dt.strftime('%d')
small_sales_records['Ship Day'] = pd.to_datetime(small_sales_records['Ship Date'], errors='coerce')
small_sales_records['Ship Day'] = small_sales_records['Ship Date'].dt.strftime('%d')

large_sales_records['Order Month'] = pd.to_datetime(large_sales_records['Order Date'], errors='coerce')
large_sales_records['Order Month'] = large_sales_records['Order Date'].dt.strftime('%m')
small_sales_records['Order Month'] = pd.to_datetime(small_sales_records['Order Date'], errors='coerce')
small_sales_records['Order Month'] = small_sales_records['Order Date'].dt.strftime('%m')
large_sales_records['Order Day'] = pd.to_datetime(large_sales_records['Order Date'], errors='coerce')
large_sales_records['Order Day'] = large_sales_records['Order Date'].dt.strftime('%d')
small_sales_records['Order Day'] = pd.to_datetime(small_sales_records['Order Date'], errors='coerce')
small_sales_records['Order Day'] = small_sales_records['Order Date'].dt.strftime('%d')


# I will now drop the "Order ID ","Order Date","Ship Date" and "Country" columns from both data sets because it is of no importance in our analysis/model building
large_sales_records =large_sales_records.drop("Order ID",axis=1)
small_sales_records=small_sales_records.drop("Order ID",axis=1)
large_sales_records=large_sales_records.drop("Country",axis=1)
small_sales_records=small_sales_records.drop("Country",axis=1)
large_sales_records=large_sales_records.drop("Ship Date",axis=1)
small_sales_records=small_sales_records.drop("Ship Date",axis=1)
large_sales_records=large_sales_records.drop("Order Date",axis=1)
small_sales_records=small_sales_records.drop("Order Date",axis=1)



3). Build and Evaluate Models:

In [None]:
#################### DECiSION TREE MODEL ####################

#SMALL DATA SET
#Encoding the "object" and "category" data type columns to numerical values for each data set
# Iterating over all the values of each column and extract their dtypes
le=LabelEncoder()
for col in small_sales_records.columns.to_numpy():
    # Comparing if the dtype is object
    if small_sales_records[col].dtypes in ('object','category'):
    # Using LabelEncoder to do the numeric transformation
        small_sales_records[col]=le.fit_transform(small_sales_records[col].astype(str))
#retrieving the "feature" and "class" column(s)
df_small = pd.DataFrame(data=small_sales_records,columns=['Region','Item Type','Sales Channel','Units Sold','Unit Price','Unit Cost','Ship Month','Ship Day','Order Month','Order Day'])
x=['Region','Sales Channel','Item Type','Ship Month','Ship Day','Order Month','Order Day','Units Sold','Unit Price','Unit Cost']
 # Splitting the dataset into train and test
X_train_small, x_test_small, y_train_small, y_test_small = train_test_split(df_small[x], small_sales_records['Order Priority'],  random_state=0)
target_names = ['High','Medium','Critical','Low']
#train model
model = DecisionTreeClassifier(random_state=0, max_depth=3)
model = model.fit(X_train_small,y_train_small)

#Build decision tree
fig=plt.figure(figsize=(20,20))  # customize according to the size of your tree
_=tree.plot_tree(model, feature_names = x ,filled=True)
plt.show()

#Evaluate model
y_pred_small = model.predict(x_test_small)
print(y_pred_small)
rmse = float(format(np.sqrt(mean_squared_error(y_test_small, y_pred_small)),'.3f'))
print("\nRMSE:",rmse)
#calculate the accuracy of the model
print("Accuracy : ",accuracy_score(y_test_small, y_pred_small)*100)
#retrieve a full report
print("Report : ",  classification_report(y_test_small, y_pred_small,target_names=target_names))





#LARGE DATA SET
#Encoding the "object" and "category" data type columns to numerical values for each data set
# Iterating over all the values of each column and extract their dtypes
le=LabelEncoder()
for col in large_sales_records.columns.to_numpy():
    # Comparing if the dtype is object
    if large_sales_records[col].dtypes in ('object','category'):
    # Using LabelEncoder to do the numeric transformation
        large_sales_records[col]=le.fit_transform(large_sales_records[col].astype(str))
#retrieving the "feature" and "class" column(s)
df_large = pd.DataFrame(data=large_sales_records,columns=['Region','Item Type','Sales Channel','Units Sold','Unit Price','Unit Cost','Ship Month','Ship Day','Order Month','Order Day'])
x=['Region','Sales Channel','Item Type','Ship Month','Ship Day','Order Month','Order Day','Units Sold','Unit Price','Unit Cost']
 # Splitting the dataset into train and test
X_train_large, x_test_large, y_train_large, y_test_large = train_test_split(df_large[x], large_sales_records['Order Priority'],  random_state=0)
target_names = ['High','Medium','Critical','Low']
#train model
model = DecisionTreeClassifier(random_state=0, max_depth=3)
model = model.fit(X_train_large,y_train_large)

#Build decision tree
fig=plt.figure(figsize=(20,20))  # customize according to the size of your tree
_=tree.plot_tree(model, feature_names = x ,filled=True)
plt.show()

#Evaluate model
y_pred_large = model.predict(x_test_large)
print(y_pred_large)
rmse = float(format(np.sqrt(mean_squared_error(y_test_large, y_pred_large)),'.3f'))
print("\nRMSE:",rmse)
#calculate the accuracy of the model
print("Accuracy : ",accuracy_score(y_test_large, y_pred_large)*100)
#retrieve a full report
print("Report : ",  classification_report(y_test_large, y_pred_large,target_names=target_names))


#################### MULTINOMICAL LOGISTIC REGRESSION MODEL ####################

#SMALL DATA SET
le=LabelEncoder()
for col in small_sales_records.columns.to_numpy():
    # Comparing if the dtype is object
    if small_sales_records[col].dtypes in ('object','category'):
    # Using LabelEncoder to do the numeric transformation
        small_sales_records[col]=le.fit_transform(small_sales_records[col].astype(str))
#retrieving the "feature" and "class" column(s)
x=['Region','Item Type','Sales Channel','Units Sold','Unit Price','Unit Cost','Ship Month','Ship Day','Order Month','Order Day']
df_small = pd.DataFrame(small_sales_records,columns=[x])
df_small.fillna(0, inplace=True)
target_names = ['High','Medium','Critical','Low']
#train model
train_small_x, test_small_x,train_small_y, test_small_y = train_test_split(df_small[x], small_sales_records['Order Priority'], test_size = 0.2)

#build model
log_reg_small = LogisticRegression(solver='newton-cg', multi_class='multinomial')
log_reg_small.fit(train_small_x,train_small_y)
fig, ax = plt.subplots(figsize=(6, 6))
sea.regplot(x= df_small['Region'], y= small_sales_records['Order Priority'], data=df_small, logistic= True,label="Region")
sea.regplot(x= df_small['Item Type'], y= small_sales_records['Order Priority'], data=df_small, logistic= True,label="Item Type")
sea.regplot(x= df_small['Units Sold'], y= small_sales_records['Order Priority'], data= df_small, logistic= True,label="Units Sold")
sea.regplot(x= df_small['Unit Price'], y= small_sales_records['Order Priority'], data= df_small, logistic= True,label="Unit Price")
sea.regplot(x= df_small['Unit Cost'], y= small_sales_records['Order Priority'], data=df_small, logistic= True,label="Unit Cost")
ax.set(ylabel='Order Priority', xlabel='Predictor Variables')
ax.legend()
plt.show()

#Evaluate model
pred_small_y = log_reg_small.predict(test_small_x)
print(pred_small_y)
rmse = float(format(np.sqrt(mean_squared_error(test_small_y, pred_small_y)),'.3f'))
print("\nRMSE:",rmse)
#calculate the accuracy of the model
print("Accuracy : ",accuracy_score(test_small_y, pred_small_y)*100)
#retrieve a full report
print("Report : ",  classification_report(test_small_y, pred_small_y,target_names=target_names))


#LARGE DATA SET
le=LabelEncoder()
for col in large_sales_records.columns.to_numpy():
    # Comparing if the dtype is object
    if large_sales_records[col].dtypes in ('object','category'):
    # Using LabelEncoder to do the numeric transformation
        large_sales_records[col]=le.fit_transform(large_sales_records[col].astype(str))
#retrieving the "feature" and "class" column(s)
x=['Region','Item Type','Sales Channel','Units Sold','Unit Price','Unit Cost','Ship Month','Ship Day','Order Month','Order Day']
df_large = pd.DataFrame(large_sales_records,columns=[x])
df_large.fillna(0, inplace=True)
target_names = ['High','Medium','Critical','Low']
#train model
train_large_x, test_large_x,train_large_y, test_large_y = train_test_split(df_large[x], large_sales_records['Order Priority'], test_size = 0.2)

#build model
log_reg_large = LogisticRegression(solver='newton-cg', multi_class='multinomial')
log_reg_large.fit(train_large_x,train_large_y)
fig, ax = plt.subplots(figsize=(6, 6))
sea.regplot(x= df_large['Region'], y= large_sales_records['Order Priority'], data=df_large, logistic= True,label="Region")
sea.regplot(x= df_large['Item Type'], y= large_sales_records['Order Priority'], data=df_large, logistic= True,label="Item Type")
sea.regplot(x= df_large['Units Sold'], y= large_sales_records['Order Priority'], data= df_large, logistic= True,label="Units Sold")
sea.regplot(x= df_large['Unit Price'], y= large_sales_records['Order Priority'], data= df_large, logistic= True,label="Unit Price")
sea.regplot(x= df_large['Unit Cost'], y= large_sales_records['Order Priority'], data=df_large, logistic= True,label="Unit Cost")
ax.set(ylabel='Order Priority', xlabel='Predictor Variables')
ax.legend()
plt.show()

#Evaluate model
pred_large_y = log_reg_large.predict(test_large_x)
print(pred_large_y)
rmse = float(format(np.sqrt(mean_squared_error(test_large_y, pred_large_y)),'.3f'))
print("\nRMSE:",rmse)
#calculate the accuracy of the model
print("Accuracy : ",accuracy_score(test_large_y, pred_large_y)*100)
#retrieve a full report
print("Report : ",  classification_report(test_large_y, pred_large_y,target_names=target_names))




Essay:

For my analysis I utilized two Sales data sets, which were randomly generated. One data set contains 5000 sales records while another one contains 100000 sales records. Each data set had the same number of features (14). The data set contains features which are both cateogical ("Region","Item Type","Sales Channel" and"Order Priority") and numerical ("Order Date", "Order ID", "Ship Date", "Units Sold","Unit Price", "Unit Cost", "Total Revenue","Total Cost" and "Total Profit"). For the analysis of these algorithms I decided to use the: Multinomial logistic regression and Decision Tree Machine learning algorithms. Many features can be utilized for prediction in both data sets. "Total Revenue" based on "Region","Item Type" and "Sales Channel", etc... For my analysis, I would like to focus on creating models which will, with the best accuracy, classify and predict  "Order Priority". I believe that the ability to predict the correct "Order Priority" can help boost sales because if we identify certain patterns that products with higher unit price have higher priority and in consequence, be able to classify them we will produce more of the products for those particular classifications and in consequence generate more revenue.

There were several machine learning algorithms which we have learned so far this semester, which I could choose from. These include: Multinomial Logistic Regression, KNN and Decision Trees. I chose Multinomial Logistic Regression and Decision Trees. The reason, I did not choose KNN for building models from these particular data sets is that the data is scalable and as it grows the training of the KNN model can get so long that it will take a very long time to process the algorithm. For the purposes of being able to classify "Order Priority", I utilized the decision tree. Some benefits of decision tree is they are unbias---- making no assumptions about the data. Decison trees are very flexible with handeling a variety of data types and missing values. This means no significant processing power will be noticed as opposed to some other algorithms.  Decision trees are sensitive to data change,any slight change to the input variables can fully change the model. 

After running the decison tree for both the large and small data sets, I observed that the small data set tree had an accuracy of: 24% with RMSE of 1.698 and that the large data set tree had an accuracy of 24.8% with an RMSE of 1.308. It is also important to mention that the larger data set tree had much higher F1 values which would indicate that the larger model has a more balanced performance (showing higer percison and higher recall). Overall, the larger decsion tree data set model has just a bit more accuracy than a smaller one. It is worth mentioning that both data sets processed very quickly and at an equal speed conveying the advantages of decsion trees---- no significant power processing changes are noticed between running both the large and small data sets.

After running the multinomial logistic regression model for both the large and small data sets, I noticed that the smaller data set had an accuracy of 27.8% with and RMSE of 1.842 and the larger model had an accuracy of: 24.7% with an RMSE of 1.883. In regards to F1 scores, it looks like the larger model had bigger scores. By the results we can see that the larger model was slightly more accurate but the difference between the predicted and actual values (RMSE) is basically the same. One important note to mention is about the processing power.  Multilinomial regression of large data set takes much longer to execute than that of the smaller data set. The difference in processing power is big.

In conclusion, we can see that models using both the small and larger data sets have very similiar accuracy. The one noticable difference between the large and small data sets was noticed in processing power. While decsion trees had a very similiar run time with both the small and large data set, in multinomial logistic regression, the large data set had a much slower run time than that of the small data set. For the this reason, for this business case I would prefer running the decsion tree algorithm than the multinomial logistic regression algorithm. Assuming the number of features are the same in both cases, the decsion tree is fast for both data sets, whereas multinomal logistic regression is very slow for the large data set. Knowing the fact that Sales data can be scalable, performance issues are crucial and spending extra money on infrastructure to speed up the models might prove costly and take away from the generated revenue.