# Final Project-Group 5 Data Science Class

## Step 0: Dataset Description

The dataset gives a detailed view of sales transactions, including customer demographics (age, gender, location), product details, and financial metrics like cost, revenue, and profit. It supports analysis of purchasing behavior across demographic groups and trends over time based on product quantity sold. Containing both numeric and categorical data, it enables multi-level analysis by product, customer, and region using various visualization and statistical techniques. It has 13 columns and 34865 rows. 

## Step 1: Data Pipeline & Preparation

### 1.1. Data Acquisition

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

from statsmodels.tsa.arima.model import ARIMA

from IPython.display import display
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)

In [4]:
pd.set_option('display.max_columns', None)
df = pd.read_csv(r"C:\Users\hp probook\Desktop\git_test\Group 5-Dataset for Final Project.csv")
df.head()
print(df)

       index       Date    Year     Month  Customer Age Customer Gender        Country           State Product Category     Sub Category  Quantity  Unit Cost   Unit Price    Cost      Revenue
0          0  2/19/2016  2016.0  February          29.0               F  United States      Washington      Accessories  Tires and Tubes       1.0      80.00   109.000000    80.0   109.000000
1          1  2/20/2016  2016.0  February          29.0               F  United States      Washington         Clothing           Gloves       2.0      24.50    28.500000    49.0    57.000000
2          2  2/27/2016  2016.0  February          29.0               F  United States      Washington      Accessories  Tires and Tubes       3.0       3.67     5.000000    11.0    15.000000
3          3  3/12/2016  2016.0     March          29.0               F  United States      Washington      Accessories  Tires and Tubes       2.0      87.50   116.500000   175.0   233.000000
4          4  3/12/2016  2016.0     Marc

### 1.2 Check Data Types

In [5]:
print(df.info())
print(df.isnull().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34867 entries, 0 to 34866
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   index             34867 non-null  int64  
 1   Date              34866 non-null  object 
 2   Year              34866 non-null  float64
 3   Month             34866 non-null  object 
 4   Customer Age      34716 non-null  float64
 5   Customer Gender   34766 non-null  object 
 6   Country           34746 non-null  object 
 7   State             34866 non-null  object 
 8   Product Category  34866 non-null  object 
 9   Sub Category      34866 non-null  object 
 10  Quantity          34666 non-null  float64
 11  Unit Cost         34686 non-null  float64
 12  Unit Price        34706 non-null  float64
 13  Cost              34866 non-null  float64
 14  Revenue           34567 non-null  float64
dtypes: float64(7), int64(1), object(7)
memory usage: 4.0+ MB
None
index                 0
D

### 1.3. Handle Duplicates & Outliers

In [6]:
df.drop_duplicates(inplace=True)

for col in ['Quantity', 'Unit Cost', 'Unit Price', 'Revenue']:
    q_low = df[col].quantile(0.01)
    q_hi  = df[col].quantile(0.99)
    df = df[(df[col] >= q_low) & (df[col] <= q_hi)]
print(df.info())
print(df.isnull().sum())

<class 'pandas.core.frame.DataFrame'>
Index: 32263 entries, 0 to 34865
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   index             32263 non-null  int64  
 1   Date              32263 non-null  object 
 2   Year              32263 non-null  float64
 3   Month             32263 non-null  object 
 4   Customer Age      32130 non-null  float64
 5   Customer Gender   32171 non-null  object 
 6   Country           32152 non-null  object 
 7   State             32263 non-null  object 
 8   Product Category  32263 non-null  object 
 9   Sub Category      32263 non-null  object 
 10  Quantity          32263 non-null  float64
 11  Unit Cost         32263 non-null  float64
 12  Unit Price        32263 non-null  float64
 13  Cost              32263 non-null  float64
 14  Revenue           32263 non-null  float64
dtypes: float64(7), int64(1), object(7)
memory usage: 3.9+ MB
None
index                 0
Date  

### 1.4 Convert Columns

In [7]:
# Convert 'Date' to datetime safely
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')

# Drop rows with missing dates (optional, depending on context)
df = df.dropna(subset=['Date'])

# Safely get dummies for existing categorical columns
categorical_cols = ['Customer Gender', 'Country', 'State', 'Product Category', 'Sub Category']
categorical_cols = [col for col in categorical_cols if col in df.columns]

df = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

# Check types
df.dtypes

index                                    int64
Date                            datetime64[ns]
Year                                   float64
Month                                   object
Customer Age                           float64
                                     ...      
Sub Category_Shorts                       bool
Sub Category_Socks                        bool
Sub Category_Tires and Tubes              bool
Sub Category_Touring Bikes                bool
Sub Category_Vests                        bool
Length: 76, dtype: object

## Step 2: Exploratory Data Analysis (EDA)

### 2.1 Descriptive Statistics