# Data Wrangling and Subsetting
 Clean, manipulate, and subset data using Python libraries like Pandas and NumPy.

## What is data wrangling

- The process of cleaning, transforming, and organizing raw data into a usable format.

- Why it matters: Real-world data is often messy and unstructured. Wrangling ensures data is ready for analysis.

- Tools: Pandas (for data manipulation) and NumPy (for numerical operations).

## 01. Import Libraries

In [5]:
import pandas as pd
import numpy as np
import os

## 02. Load data 

In [7]:
folderpath = r'C:\Users\Bamidele\Desktop\Training_content\Data Analytics Training - Techbams Solutions\Python\BasketAnalysis_python_project'

In [9]:
folderpath

'C:\\Users\\Bamidele\\Desktop\\Training_content\\Data Analytics Training - Techbams Solutions\\Python\\BasketAnalysis_python_project'

In [11]:
df = pd.read_csv(os.path.join(folderpath, '02 Data', 'Raw Data', 'sales_data_wrangling.csv'), index_col = False)

In [13]:
print(df)

   OrderID Product  Quantity  Price Customer Region
0        1   Apple      10.0    0.5     John  North
1        2  Banana       5.0    0.3     Anna  South
2        3   Apple       7.0    0.5     John  North
3        4  Orange       3.0    0.8     Anna  South
4        5  Banana       6.0    0.3     John  North
5        6   Apple       NaN    0.5      NaN  North
6        7  Orange       8.0    0.8     Anna  South
7        8  Banana       2.0    0.3     John  North
8        9   Apple       NaN    0.5     Anna  South
9       10  Orange       5.0    0.8     John  North


### Inspect the data

In [41]:
print(df.head())  # First 5 rows
print(df.info())  # Data types and missing values
print(df.describe())  # Summary statistics
print(df.columns)  # Column names
print(df.shape)  # Number of rows and columns

   OrderID Product  Quantity  Price Customer Region
0        1   Apple      10.0    0.5     John  North
1        2  Banana       5.0    0.3     Anna  South
2        3   Apple       7.0    0.5     John  North
3        4  Orange       3.0    0.8     Anna  South
4        5  Banana       6.0    0.3     John  North
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   OrderID   10 non-null     int64  
 1   Product   10 non-null     object 
 2   Quantity  8 non-null      float64
 3   Price     10 non-null     float64
 4   Customer  9 non-null      object 
 5   Region    10 non-null     object 
dtypes: float64(2), int64(1), object(3)
memory usage: 612.0+ bytes
None
        OrderID  Quantity      Price
count  10.00000   8.00000  10.000000
mean    5.50000   5.75000   0.530000
std     3.02765   2.60494   0.205751
min     1.00000   2.00000   0.300000
25%     3.25000   

## 03. Data Cleaning or wrangling


In [43]:
### Handling missing values
df.isnull().sum()  # Check missing values

OrderID     0
Product     0
Quantity    2
Price       0
Customer    1
Region      0
dtype: int64

- Quantity has **2** null value and Customer has **1**

In [None]:
df.dropna(inplace=True)  # Remove missing values
df.fillna("Unknown", inplace=True)  # Replace missing values

In [None]:
# Handle missing values in the 'Quantity' column
df['Quantity'].fillna(0, inplace=True)

In [19]:
df['Region'].dtype

dtype('O')

In [27]:
df

Unnamed: 0,OrderID,Product,Quantity,Price,Customer,Region
0,1,Apple,10.0,0.5,John,North
1,2,Banana,5.0,0.3,Anna,South
2,3,Apple,7.0,0.5,John,North
3,4,Orange,3.0,0.8,Anna,South
4,5,Banana,6.0,0.3,John,North
5,6,Apple,,0.5,,North
6,7,Orange,8.0,0.8,Anna,South
7,8,Banana,2.0,0.3,John,North
8,9,Apple,,0.5,Anna,South
9,10,Orange,5.0,0.8,John,North


In [33]:
df.iloc[4:7]

Unnamed: 0,OrderID,Product,Quantity,Price,Customer,Region
4,5,Banana,6.0,0.3,John,North
5,6,Apple,,0.5,,North
6,7,Orange,8.0,0.8,Anna,South


In [43]:
df.iloc[0]

OrderID         1
Product     Apple
Quantity     10.0
Price         0.5
Customer     John
Region      North
Name: 0, dtype: object

In [45]:
df.iloc[1:]

Unnamed: 0,OrderID,Product,Quantity,Price,Customer,Region
1,2,Banana,5.0,0.3,Anna,South
2,3,Apple,7.0,0.5,John,North
3,4,Orange,3.0,0.8,Anna,South
4,5,Banana,6.0,0.3,John,North
5,6,Apple,,0.5,,North
6,7,Orange,8.0,0.8,Anna,South
7,8,Banana,2.0,0.3,John,North
8,9,Apple,,0.5,Anna,South
9,10,Orange,5.0,0.8,John,North


In [53]:
  bams =  df[  df['Region'] == 'South']

In [55]:
bams

Unnamed: 0,OrderID,Product,Quantity,Price,Customer,Region
1,2,Banana,5.0,0.3,Anna,South
3,4,Orange,3.0,0.8,Anna,South
6,7,Orange,8.0,0.8,Anna,South
8,9,Apple,,0.5,Anna,South
