<a href="https://colab.research.google.com/github/Shaghayegh-bgh/Data-Exploration-and-Preprocessing/blob/main/pre_proccessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Data Exploration and Preprocessing**

### Load the libraries

In [None]:
import pandas as pd
import numpy as np


### Loading the dataset

In [None]:
df = pd.read_excel('stock.xlsx')

### Preview First Rows of Dataset

In [None]:
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


### Shape of the Dataframe

In [None]:
# the number of rows and columns in the DataFrame as a tuple
print("Shape of dataset:", df.shape)

Shape of dataset: (541909, 8)


### Basic Array Info from Dataset

In [None]:
# Convert the pandas DataFrame to a NumPy array
data_array = df.to_numpy()

# number of rows and columns
print("\nArray shape:", data_array.shape)

# total number of elements in the array
print("Array size:", data_array.size)

# the number of dimensions of the array
print("Array dimensions (ndim):", data_array.ndim)


Array shape: (541909, 8)
Array size: 4335272
Array dimensions (ndim): 2


### Checking missing values

In [None]:
print("\nNull counts per column:")
print(df.isnull().sum())


Null counts per column:
InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
dtype: int64


### Filling missing values

In [None]:
# We did not drop the missing values in the CustomerID column due to their large number
# we filled them with the word "unknown"
df['CustomerID'] = df['CustomerID'].fillna('Unknown')

# the missing values in the Description column were filled with "No Description"
df['Description'] = df['Description'].fillna('No Description')


print(df.isnull().sum())

InvoiceNo      0
StockCode      0
Description    0
Quantity       0
InvoiceDate    0
UnitPrice      0
CustomerID     0
Country        0
dtype: int64


### Data Types of Each Column

In [None]:
print("\nData types per column:")
print(df.dtypes)


Data types per column:
InvoiceNo              object
StockCode              object
Description            object
Quantity                int64
InvoiceDate    datetime64[ns]
UnitPrice             float64
CustomerID            float64
Country                object
dtype: object


### Check Columns for Different Data Types

In [None]:
# For each column in the DataFrame,
# it identifies the data type of each value
# and counts the number of occurrences for each type
for col in df.columns:
    types = df[col].map(type).value_counts()

    # If a column contains more than one data type it prints a warning message.
    if len(types) > 1:
        print(f"\n Mixed data types found in column '{col}':")
        print(types)


 Mixed data types found in column 'InvoiceNo':
InvoiceNo
<class 'int'>    532618
<class 'str'>      9291
Name: count, dtype: int64

 Mixed data types found in column 'StockCode':
StockCode
<class 'int'>    487036
<class 'str'>     54873
Name: count, dtype: int64

 Mixed data types found in column 'Description':
Description
<class 'str'>    541908
<class 'int'>         1
Name: count, dtype: int64


### Analysis of the output :

The columns InvoiceNo and StockCode mostly contain numeric data, but there are a few thousand string values that need to be standardized (for example, converting all values to strings).

The Description column is almost entirely textual, with only one incorrect numeric value that needs to be corrected.

### Converting the values in the InvoiceNo and StockCode columns to strings

In [None]:
# Converting the InvoiceNo and StockCode columns to strings
df['InvoiceNo'] = df['InvoiceNo'].astype(str)
df['StockCode'] = df['StockCode'].astype(str)

# Converting Description to string
df['Description'] = df['Description'].astype(str)
