# Feature Analysis

In [None]:
import numpy as np
import pandas as pd

from scipy.stats import chi2_contingency

import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
raw_df = pd.read_csv("/kaggle/input/customer-analytics/Train.csv")
raw_df.head()

## Individual Relations of Shipment Factors with Success
### Correlations with Numerical Factors

In [None]:
raw_df.corr()['Reached.on.Time_Y.N'].sort_values().reset_index()

Individual correlations of success with other factors are generally close to zero. However, it can still provide us some insights.

- __ID__: The highest negative correlation. It means nothing since ID is categorical. However, if IDs are given in a chronological order, we may conclude that service quality get lower by the time.
- __Weight_in_gms__: Negative correlation might be reasonable since it would be harder to handle shipment of heavier products.
- __Cost_of_the_Product__: Very week negative correlation. It might be because of a possible correlation with weight of the product, which is not calculated yet.
- __Customer_care_calls__: Very week negative correlation. The problems with shipment may require more calls.
- __Prior_purchases__: Very week negative correlation. Customer acquisition might be main strategy rather than customer retention. However, magnitude of correlation is too low to make a certain comment on it.
- __Customer_rating__: No correlation. Personally, I would expect a positive correlation.
- __Discount_offered__: Positive correlation. Probably, high discount rates are offered to more important customers whose shipments are priortized to be completed on time.

### Categorical Relations

We cannot use correlations to describe relations for categorical variables. But we can investigate whether all groups have similar success rate or not for any given categorical feature. So if all groups have similar success rates then we can say that this categorical feature has no effect on shipment success.

For this reason, I used Chi-square test for independence of two categorical variables.

In [None]:
contingency_table = raw_df.groupby(['Warehouse_block','Reached.on.Time_Y.N']).size().reset_index() \
    .rename(columns={0:'Counts'}).pivot(columns='Warehouse_block',index='Reached.on.Time_Y.N')
print(contingency_table)

stat, p, dof, expected = chi2_contingency(contingency_table)
print(f'\nP value for independence: {p:.4f}')

P value is far away from zero which lead us to conclude with that: **There is no relation between warehouse block and shipment success**. You can also have same idea by just looking the contingency table. Success ratio for every warehouse is similar (about 60%).

Lets do the same analysis for the other variables.

In [None]:
other_cat_cols = ['Mode_of_Shipment','Product_importance','Gender']
for cat_col in other_cat_cols:
    contingency_table = raw_df.groupby([cat_col,'Reached.on.Time_Y.N']).size().reset_index() \
        .rename(columns={0:'Counts'}).pivot(columns=cat_col,index='Reached.on.Time_Y.N')
    print(contingency_table)

    stat, p, dof, expected = chi2_contingency(contingency_table)
    
    print(f'\nP value for independence: {p:.4f}\n\n')

Shipment type and gender seems unrelated with the shipment success. However. **the shipment success is not independent from the product importance**. Lets see the success rates by product importances.

In [None]:
order = {'high':0,'medium':1,'low':2}
contingency_table = raw_df.groupby(['Product_importance','Reached.on.Time_Y.N']).size().reset_index() \
        .pivot(columns='Product_importance',index='Reached.on.Time_Y.N')
success_ratios = contingency_table.div(contingency_table.sum(0), axis=1).iloc[1].reset_index().iloc[:,1:]
success_ratios = success_ratios.sort_values('Product_importance',key = lambda col: col.map(order)).rename(columns={1:'Success Ratio'})
success_ratios

It clearly seems that **high products with importance have more successful history of shipment**. It is reasonable.

### Further Analysis of Some Features

We already looked for correlations for those features. But we did not take their distributions into account. It would be a good choice to use scatter plot with regression line. However, since our output value is binary, trying something different might be useful.

I will separate continuous features into equally sized chnuks and will look for success rate of each chunk.

In [None]:
columns_to_categorize = ['Discount_offered','Cost_of_the_Product','Weight_in_gms']

In [None]:
all_categorical = raw_df.copy()
n_chunks = 10
for col in columns_to_categorize:
    all_categorical[col] = all_categorical[col]
    all_categorical = all_categorical.sort_values(col).reset_index(drop=True)
    all_categorical[col] = all_categorical.index
    all_categorical[col] = all_categorical[col]/all_categorical[col].max()*(n_chunks-.0001)
    all_categorical[col] = "Chunk " + (all_categorical[col]//1).astype(int).astype(str)
all_categorical

In [None]:
all_categorical.groupby(['Discount_offered'])['Reached.on.Time_Y.N'].mean().plot()
plt.title("Discount Rate (Chunked) vs. Success")
plt.show()

all_categorical.groupby(['Cost_of_the_Product'])['Reached.on.Time_Y.N'].mean().plot()
plt.title("Cost of Product (Chunked) vs. Success")
plt.show()

all_categorical.groupby(['Weight_in_gms'])['Reached.on.Time_Y.N'].mean().plot()
plt.title("Weight of Product (Chunked) vs. Success")
plt.show()

I cut each of these three features into 10 equally sized chunks. You can see them on X axis. Y axis shows mean success ratio of each chunk

-- **Discount_offered**: *Chunk 0* shows the lowest 10% percentile of discount_offered while *Chunk 9* shows shipments with the highest 10% percentile of discount rates. We already know that it is positively correlated with success. But the effect is only observable on last two chunks.

-- **Cost_of_the_Product**: *Chunk 0* shows the lowest 10% percentile of costs. The graph has a downward slope means that high cost groups have more risk of shipment failure. It is hard to find any direct reason for this. Equal chunk sizes might cause a misleading view.

-- **Weight_in_gms**: It seems like lighter products are easier to ship. Chunk 3 has 100% success, which is interesting.

To understand data, it is always better to look at it from different aspects. We investigated data with equally sized chunks. It has some disadvantages that may mislead us because each chunk has different value range. So I will also plot histograms of these features by failure and success:

In [None]:

for_histogram = raw_df[np.append(columns_to_categorize,'Reached.on.Time_Y.N')]
for_histogram['Succes'] = for_histogram['Reached.on.Time_Y.N'].map({0:'Failure',1:'Success'})
for_histogram_successAll = for_histogram.copy()
for_histogram_successAll['Succes'] = 'All'

for_histogram = pd.concat([for_histogram,for_histogram_successAll],0)
for c in columns_to_categorize:
    g = sns.FacetGrid(for_histogram, col="Succes")
    g.map(sns.histplot, c,bins=25)
    plt.show()

In plots, the first column indicates success, the second one indictaces failure of the shipment. Third column shows the distribution of all shipments.

-- **Discount_offered**: The data is right skewed and there is no failure after a certain discount rate. This indicates that higher discounts are exceptional,and for the prime customers.This is consistent with line plot of chunks. Last two chunks probably had high variance values.

-- **Cost_of_the_Product**: For success and failure situations, we have similar distributions. However as the value increases, ratio of succes to failure is going down slightly. That is the cause of the downward slope of line plot of chunks above.

-- **Weight_in_gms**: The data has bimodal distribution with insteresting cut off points. There is no failure for a large range of weight values between 2000 and 4000. In the same range, number of successful shippings are also slightly lower.

# Abnormalities

### 1. ID vs Success Rate

In [None]:
ma_success = raw_df['Reached.on.Time_Y.N'].rolling(100,1).mean()
ma_success.plot()
plt.title('Moving Average of Success Rate Ordered by ID')
plt.show()

first_fail = np.where(ma_success < 1)[0][0]
print(f'First shipment failure is at ID of {first_fail}')

If success is ordered by Customer IDs, **there seems no shipment failure until 3135th ID**. Success seems random after that ID. The reason might be one of these:
1. **There is no failure for a long time at the beginning**. It might be plausible if we assume ID is chronological and workload for the company was less at the beginning. But still, it is a strange pattern since there is NO failure for a long time and it suddenly goes random without any pattern after that point.
2. **Consecutive succeses occured by chance**. You can check it out. It is not so plausible for a random data that 30% of its entries are ordered like that!
3. **A change occured in data entry process**. It is possible because of different data sources, different personnel etc.
4. **First entries are faulty**. Personally, It seems the most likely reason to me.

Therefore, I recommend you to try your prediction model with and without this section of the data. It might differ.

### 2. Disitribution of Product Weight, Cost, and Importance

In [None]:
sns.scatterplot(data = raw_df,x='Weight_in_gms',y='Cost_of_the_Product')
plt.title("Weight of Product vs. Cost of Product")
plt.show()

Weight vs cost graph seems so 'synthetic'. **There are some groups that have sharp boundaries and uniform distributions which is not seems natural**. Lets check out product importance in addition the cırrent features.

In [None]:
sns.scatterplot(data = raw_df,x='Weight_in_gms',y='Cost_of_the_Product',hue = 'Product_importance')
plt.title("Weight of Product vs. Cost of Product")
plt.show()

I would expect cost of product and product importance to be related. However, **product importance seems distributed randomly**. Maybe product importance indicates something related with customer importance or hazardousness of the product.

### 3. Distribution of Product Weight for Failed Shipments

In [None]:
g = sns.FacetGrid(raw_df, col="Reached.on.Time_Y.N")
g.map(sns.histplot, 'Weight_in_gms')
plt.show()

For failed shipments distribution of weights has something unnatural. **There is no failure between 2,000-4,000 g weights while before 2,000 and after 4,000 there is a uniform pattern**.

You can recall the same absence of data in cost vs weight graph:

In [None]:
sns.scatterplot(data = raw_df,x='Weight_in_gms',y='Cost_of_the_Product')
plt.title("Weight of Product vs. Cost of Product")
plt.show()

In [None]:
sns.scatterplot(data = raw_df,x='ID',y='Weight_in_gms',hue='Reached.on.Time_Y.N')
plt.title("ID vs. Weight of Product")
plt.show()

This last one is really interesting. It is combination of our two findings. **We have totally different different data before and after ID 3135**.Before than that we have product weights that uniformly distributed (except outliers) between 1000 and 4000. And interestingly, all of our shipments were made on time! After that ID, we have two distinct weight groups totally different from the former distribution. And we have a random success rate along the IDs and weights. It seems like two different companies data is combined.