# Report - Otto Group product classification

[Introduction](#section1)

[Client](#section2)

[Data and its acquisition](#section3)

[Data Exploration](#section4)

- [Cleaning](#section5)
- [Is there class imbalance?](#section6)
- [Visualizing the data](#section7)
- [Are some features related to each other?](#section8)
- [Are the correlations statistically significant?](#section9)
- [Important features to predict the target product category](#section10)

<a id='section1'></a>
# Introduction

Imagine that you are one of the biggest e-commerce companies in the world, with subsidiaries in many countries. Millions of products are sold and thousands are added to the product line every day. A consistent analysis of the products become crucial. But due to diverse global infrastructure, many identical products get classified differently. Therefore the quality of the product analysis depends heavily on the ability to accurately cluster similar products. The better the classification the more insights can be generated about the product range.

<a id='section2'></a>
# Client

Here the client is the Otto-group company, one among the top e-commerce companies. We hope to provide the client with the best of the algorithms that would classify the given products as accurately as possible.

<a id='section3'></a>
# Data and its acquisition

The data is provided by the client in the form of csv file. There is some additional test data that can be used to evaluate and report the algorithm success. The main data file has around 61 thousand products. Each of the product has 93 associated features and the correct class (product category) out of the total 9 classes. The actual meaning of the features is not available. So we can treat them just as numbers. Similarly the 9 classes are just numbers without any meaningful name. The actual meaning of the features and the product categories might have helped us understand the problem better. But this might be the case is many other projects where there are some confidentiality restrictions.

Loading the data to use for modeling is a very simple step. I am using the read_csv function provided by the python-pandas library. It directly loads the data into a pandas dataframe (table like structure) object and can be used for processing readily.

<a id='section4'></a>
# Data exploration

<a id='section5'></a>
### Cleaning

The data provided doesn't have null values or any missing values. The product id field is of no significance and is removed. All the 93 features together identifies the unique product for our purposes and is sufficient. The product classes are in text form and needs to be converted to the numerical values, so as to be consumed by machine learning libraries. We simply number them from 1 to 9.

So our data has 61 thousand rows with 93 feature columns named feat_1, feat_2 ...feat_93 and 1 target column having values from 1 to 9. Here is how it looks:

<img src="df-head.png">

<a id='section6'></a>
### Is there any class imbalance?

If there are approximately equal number of rows for each target class we call it "balanced classes" and "imbalanced classes" otherwise. We need to take different approaches in the both the cases. In general, the classes with more data will carry more weightage in our model unnecessarily, while the classes with less data would have negligible effect on our model. This is not what we want. We want the model to have equal representation from all the classes to do accurate predictions. Although imbalanced classes are not always a problem. Various algorithms deal with them internally. It also depends on how much the minority classes are important. Sometimes they are extremely important like in fraud detection cases. In such cases anomaly detection framework can also be used. In general there are few ways to deal with the imbalances classes:

- While in the training phase use stratified sampling
- Oversampling the minority classes
- Undersampling the majority classes
- At algorithm level, adjust the class weights or change the decision threshold
- Don't use accuracy for the model evaluation, use AUC or F1-score instead.

For the problem we are solving here, none of the class is more important than others. Also the class imbalance in the data might be natural i.e. there might be overall more products for certain classes. So we would got with stratified sampling in our model phase. Here is how the classes distribution look like:

<img src="class-imbalance.png" align="center">


The classes 2 and 6 have relatively more data.


<a id='section7'></a>
### Visualizing the data

Since the data is huge simple reading the data doesn't help. Following visualizations will help understand the data better.

##### Fig 1. Feature value counts

- 93 features presented as different colors
- The sharp jerky lines show that the feature values are integers
- Most feature values are less than 80
- Most feature values are conentrated in 0-5, hence log of value counts on y axis

<img src="feature-counts-all.png" align="left">



##### Fig 2. Feature values and their target classes

- Shows the range of the values each feature can take
- Each feature has 9 separate rows for each target class
- Two features with extreme values shown separately
- the itensity of colors shows the concentration for that feature value

<img src="all-feature-values.png" align="left">
<img src="extreme-feature-values.png" align="left">

##### Fig 3. All 60K feature rows and their target classes

- The width of the colored bars shows the class imbalances
- Shows classes that have extreme feature data values
- Features data overlap on the chart

<font color='red'>should put title etc. </font>

<img src="scatter-feature-vs-target.png" align="left">


<a id='section8'></a>
### Are some features related to each other?

Knowing if some features are related to each other helps. If 2 features are linearly correlated one of them can be removed to produce a simpler ML model that defines relation between features and target.

To find the correlation between 2 features we can find the pearson correlation index for them. If its close to 1 or -1, it would indicate strong linear relationship. But a value near 0 means no linear relation between the 2 features. We have 93 features and we need to find the correlation between all the pairs.

##### Fig 1. Correlation Matrix

- Features matrix
- Colors bar represents the correlation index variance
- Darker shaded squares have high correlation
- Most high correlations are positive meaning the 2 feature values increase or decrease together

<img src="correlation.png" align="left">



##### Fig 2. Highest correlation feature pairs

- [(8, 36), (39, 45), (3, 46), (3, 54), (9, 64), (15, 72), (29, 77), (30, 84)]

<img src="corr-feat-pairs.png" align="left">


<a id='section9'></a>
### Are the correlations statistically significant?

Formally, the correlation coefficient, r, tells us about the strength and direction of the linear relationship between x and y. However, the reliability of the linear model also depends on how many observed data points are in the sample. We need to look at both the value of the correlation coefficient r and the sample size n, together. The hypothesis test lets us decide whether the value of the population correlation coefficient 𝛒 is "close to zero" or "significantly different from zero". We decide this based on the sample correlation coefficient r and the sample size n.

Null Hypothesis: H0: 𝛒 = 0

Alternate Hypothesis: Ha: 𝛒 ≠ 0

**OR**

Null Hypothesis H0: The population correlation coefficient IS NOT significantly different from zero. There IS NOT a significant linear relationship(correlation) between x and y in the population.

Alternate Hypothesis Ha: The population correlation coefficient IS significantly DIFFERENT FROM zero. There IS A SIGNIFICANT LINEAR RELATIONSHIP (correlation) between x and y in the population.

**at significance level (α = 0.01)**

We will accept or reject the Null hypothesis based on the p values. We will apply this to all the feature pairs and find the highest correlated features. And we can compare that with our previous results too.

**Result: The p values were equal to 0 and we rejected the null hypothesis that the population correlation is zero. The pairs of features <font color=blue>[(8, 36), (39, 45), (3, 46), (3, 54), (9, 64), (15, 72), (29, 77), (30, 84)] </font>are statistically linearly correlated at significance level of 1%**

<a id='section10'></a>
### Important features to predict the target product category

For a given feature if the difference between the classes is significant then that feature is important and should be used while finding out the best model.

To find out the intra class difference for each feature we can utilize ANOVA. ANOVA can determine whether the means of three or more groups are different. It uses F-tests to statistically test the equality of means. However before we jump in doing ANOVA, there are certain assumptions that should satisfy:

- The different populations should have same variances. This is called assumption of homogeneity of variance
- The different populations should be normally distributed
- Samples are independent

The 3rd assumption is apparently satisfied because there is no evidence in against of that. The first 2 assumptions can be relaxed a bit. Here is some visualizion of the data. The data spread is quite normal (with some skew) and similar spread/variance.

##### Fig 1. Features data spread for each class (1 to 9)

<img src="../data-storytelling/kde1.png" align="left">
<img src="../data-storytelling/kde2.png" align="left">
<img src="../data-storytelling/kde3.png" align="left">
<img src="../data-storytelling/kde4.png" align="left">
<img src="../data-storytelling/kde5.png" align="left">
<img src="../data-storytelling/kde6.png" align="left">
<img src="../data-storytelling/kde7.png" align="left">
<img src="../data-storytelling/kde7.png" align="left">
<img src="../data-storytelling/kde9.png" align="left">



##### ANOVA and results

After applying the ANOVA for each feature following features came up as the important features: 
<font color="blue">feat_34, feat_11, feat_14, feat_60, feat_25</font>

Less imortant features:
<font color="blue">feat_65, feat_6, feat_51, feat_63, feat_12</font>


##### Fig. Kernel density plots for top features
- features clearly differentiate between few classes
- for example, class 5 data is striking different for feature 34

<img src="top-feat-kdes.png" align="left">


##### Fig. Tukey plots to visualize further

<img src="tukey.png" align="left">
