## raw_data.csv

•
Load and preview raw_data.csv.

•
Display a .head() and .info().

•
Add observations (e.g., missing values, suspicious columns, duplicates).

•
Save raw copies to data/ directory.

In [1]:
# Importing necessary libraries
import pandas as pd

# Implementing extraction on 'raw_data.csv'

# Loading and previewing 'raw_data.csv'
raw_data = pd.read_csv("raw_data.csv")

print(f"Extracted {len(raw_data)} rows fully")

# Displaying the number of rows and columns in 'raw_data.csv'
num_rows, num_columns = raw_data.shape
print(f"Number of rows: {num_rows}")
print(f"Number of columns: {num_columns}")


Extracted 100 rows fully
Number of rows: 100
Number of columns: 7


In [5]:
# Displaying a '.head()' and '.info()'
print(f"Below are the first 5 observations in the dataset: \n")
print(raw_data.head())

print(f"\n\n")

print(f"Summary of the data types and non-null counts in the dataset: \n")
print(raw_data.info())


Below are the first 5 observations in the dataset: 

   order_id customer_name product  quantity  unit_price  order_date region
0         1         Diana  Tablet       NaN       500.0  2024-01-20  South
1         2           Eve  Laptop       NaN         NaN  2024-04-29  North
2         3       Charlie  Laptop       2.0       250.0  2024-01-08    NaN
3         4           Eve  Laptop       2.0       750.0  2024-01-07   West
4         5           Eve  Tablet       3.0         NaN  2024-03-07  South



Summary of the data types and non-null counts in the dataset: 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   order_id       100 non-null    int64  
 1   customer_name  99 non-null     object 
 2   product        100 non-null    object 
 3   quantity       74 non-null     float64
 4   unit_price     65 non-null     float64
 5   order_date    

In [None]:
# Missing values
missing_values_raw = raw_data.isnull().sum()
print(f"The number of missing values in each column are: \n {missing_values_raw} \n")

# Suspicious columns
suspicious_columns_raw = raw_data.columns[raw_data.nunique() < 2]  # Checking for low variance
print(f"Suspicious columns include: \n {suspicious_columns_raw} \n")

# Duplicates
duplicates_raw = raw_data.duplicated().sum()
print(f"The number of duplicate rows include: \n {duplicates_raw} \n")

The number of missing values in each column are: 
 order_id          0
customer_name     1
product           0
quantity         26
unit_price       35
order_date        1
region           25
dtype: int64 

Suspicious columns include: 
 Index([], dtype='object') 

The number of duplicate rows include: 
 1 



In [None]:
# Saving raw copy
raw_data.to_csv('1_data/raw_data_copy.csv', index=False)

## incremental_data.csv

•
Load and preview incremental_data.csv.

•
Display a .head() and .info().

•
Add observations (e.g., missing values, suspicious columns, duplicates).

•
Save raw copies to data/ directory.

In [2]:
# Importing necessary libraries
import pandas as pd

# Implementing extraction on 'incremental_data.csv'

# Loading and previewing 'incremental_data.csv'
incremental_data = pd.read_csv("incremental_data.csv")

print(f"Extracted {len(incremental_data)} rows fully")

# Displaying the number of rows and columns in 'incremental_data.csv'
num_rows, num_columns = incremental_data.shape
print(f"Number of rows: {num_rows}")
print(f"Number of columns: {num_columns}")

Extracted 10 rows fully
Number of rows: 10
Number of columns: 7


In [6]:
# Displaying a '.head()' and '.info()'
print(f"Below are the first 5 observations in the dataset: \n")
print(incremental_data.head())

print(f"\n\n")

print(f"Summary of the data types and non-null counts in the dataset: \n")
print(incremental_data.info())

Below are the first 5 observations in the dataset: 

   order_id customer_name product  quantity  unit_price  order_date   region
0       101         Alice  Laptop       NaN       900.0  2024-05-09  Central
1       102           NaN  Laptop       1.0       300.0  2024-05-07  Central
2       103           NaN  Laptop       1.0       600.0  2024-05-04  Central
3       104           NaN  Tablet       NaN       300.0  2024-05-26  Central
4       105         Heidi  Tablet       2.0       600.0  2024-05-21    North



Summary of the data types and non-null counts in the dataset: 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   order_id       10 non-null     int64  
 1   customer_name  4 non-null      object 
 2   product        10 non-null     object 
 3   quantity       6 non-null      float64
 4   unit_price     10 non-null     float64
 5   orde

In [14]:
# Missing values
missing_values_incremental = incremental_data.isnull().sum()
print(f"The number of missing values in each column are: \n {missing_values_incremental} \n")

# Suspicious columns
suspicious_columns_incremental = incremental_data.columns[incremental_data.nunique() < 2]  # Checking for low variance
print(f"Suspicious columns include: \n {suspicious_columns_incremental} \n")

# Duplicates
duplicates_incremental = incremental_data.duplicated().sum()
print(f"The number of duplicate rows include: \n {duplicates_incremental} \n")

The number of missing values in each column are: 
 order_id         0
customer_name    6
product          0
quantity         4
unit_price       0
order_date       0
region           2
dtype: int64 

Suspicious columns include: 
 Index([], dtype='object') 

The number of duplicate rows include: 
 0 



In [16]:
# Saving raw copy
incremental_data.to_csv('1_data/incremental_data_copy.csv', index=False)