#  Auto Sales Data Analysis And Evaluation

 # Main Question:

Why do some countries have higher auto sales than others, and what factors (e.g., price, quantity, product line, and customer characteristics) contribute to these differences?

# Sub Questions:
Does quantity ordered significantly relate to total sales across countries or customer types? Can we predict whether a product will sell based on features like price, product line,and deal size? Can we use the number of days that have passed since the last order for each customer to analyze customer purchasing patterns?

In [None]:
import os

import numpy as np
import pandas as pd

import plotly.express as px

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

import warnings
warnings.filterwarnings("ignore")


In [None]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df = pd.read_csv("/kaggle/input/auto-sales-data/Auto Sales data.csv", index_col=0)

In [None]:
df.columns

In [None]:
df.head

In [None]:
df.info

In [None]:
df.describe

In [None]:
df.isnull().sum()

In [None]:
sales_country = df.groupby("COUNTRY")["SALES"].sum().reset_index()
fig1 = px.bar(df,
             x="COUNTRY",
              y="SALES",
              color="CITY",
              title="Total Sales per Country",
             )
fig1.update_layout(title_x=0.5)
fig1.show()

In [None]:
#The plot reveals that the United States leads in total sales, followed by Spain and France. According to the dataset, a significantly higher volume of goods is sold to customers in the U.S. compared to other countries.

In [None]:
fig2 = px.scatter(df,
                x="QUANTITYORDERED",
                y="SALES",
                color="PRODUCTLINE",
                title="QUANTITY ORDERED vs SALES",
                )
fig2.update_layout(title_x=0.5)
fig2.show()

In [None]:
#The scatter plot reveals that the majority of sales occur when the quantity ordered is between 20 and 40 units, with corresponding sales values mostly between $2,000 and $10,000. This suggests a typical order volume and sales range for most products. Additionally, by observing the color-coded clusters, I can identify which product lines (e.g.,Vintage Cars, Classic Cars,Motorcycles, etc.) are more frequently sold within this high-activity range.

In [None]:
filtered_df = df[(df['QUANTITYORDERED'] >= 20) & (df['QUANTITYORDERED'] <= 40)]
product_counts = filtered_df['PRODUCTLINE'].value_counts().reset_index()
product_counts.columns = ['PRODUCTLINE', 'OrderCount']

print(product_counts)

In [None]:
fig3 = px.bar(product_counts, x='PRODUCTLINE', y='OrderCount',
             title='Most Sold Product Lines (Quantity Ordered Between 20 and 40)',
             labels={'OrderCount': 'Number of Orders'})
fig3.update_layout(title_x=0.5)
fig3.show()

In [None]:
#Most sold productline is CLASSIC CARS

In [None]:
fig4 = px.box(df, x='DEALSIZE', y='PRICEEACH', title='Price Distribution per Deal Size')
fig4.update_layout(title_x=0.5)
fig4.show()

In [None]:
#The plot shows that large deals are typically made for products priced between 100 and 250. This suggests that higher-priced items are more common in large volume purchases, though it doesn’t necessarily mean they are sold more frequently.

In [None]:
country_summary = df.groupby('COUNTRY').agg({
    'SALES': 'sum',
    'QUANTITYORDERED': 'mean',
    'PRICEEACH': 'mean',
    'DEALSIZE': pd.Series.mode,
    'PRODUCTLINE': pd.Series.mode  
}).reset_index()

print(country_summary)

 # USA leads in total sales, which appears to be driven by a combination of:
    •	Higher average order quantity (36 units/order)
	•	Higher average price per unit (~115)
	•	Predominantly large deal sizes
	•	 Strong performance in the “Classic Cars” product line

In [None]:
fig5= px.histogram(df, x="DAYS_SINCE_LASTORDER", nbins=20,
                   title="Distribution of Days Since Last Order",
                  color="DEALSIZE")
fig5.update_layout(title_x=0.5)
fig5.show()

In [None]:
fig6 = px.scatter(df, 
                 x="DAYS_SINCE_LASTORDER", 
                 y="SALES", 
                 trendline="ols", 
                 title="Relationship Between Days Since Last Order and Sales")
fig6.update_layout(title_x=0.5)
fig6.show()

In [None]:
#As the number of days since the last order increases, sales tend to decrease.

In [None]:
print(df[['DAYS_SINCE_LASTORDER', 'SALES']].corr())

In [None]:
#Moderate Negative Correlation observed,which means the longer it’s been since a customer placed an order, the lower the sales tend to be.

In [None]:
threshold = df['SALES'].median()
df['HIGH_SALES'] = df['SALES'].apply(lambda x: 1 if x > threshold else 0)

In [None]:
df_encoded = pd.get_dummies(df[['DEALSIZE', 'PRODUCTLINE']], drop_first=True)

X = pd.concat([df[['PRICEEACH', 'QUANTITYORDERED']], df_encoded], axis=1)
y = df['HIGH_SALES']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

In [None]:
sample_data = pd.DataFrame([[
    95, 0,  14, 0, 0, 0, 0, 0, 10, 0    
]], columns=X.columns)

#example Prediction
sample_prediction = model.predict(sample_data)
print("Prediction:", "High Sales" if sample_prediction[0] == 1 else "Low Sales")

The logistic regression model performs very well, with 97% accuracy in predicting whether a product will have high or low sales. It shows a strong balance between correctly identifying both categories.

Key factors like price, quantity, deal size, and product line help the model make accurate predictions. This model can support better decisions in pricing, marketing, and inventory planning.

Overall, it’s a reliable and useful tool for sales classification.