### Data cleaning Fundamentals Sales dataset
### Summary
This projectperforms a structured data-cleaning workflow on a raw sales dataset. The process includes loading the data, inspecting its structure, correcting data types, standardizing text fields, and removing invalid or missing entries. Additional checks for duplicates and formatting issues ensure the dataset is consistent and ready for analysis or modeling.
The final output is an exported, fully cleaned CSV file that can be used in subsequent data-analysis stages.

**Goal:**
The goal of this notebook is to perform a structured data-cleaning workflow on a raw sales dataset to prepare it for reliable analysis.  




In [1]:
import pandas as pd 
df=pd.read_csv("datasets/sales_data.csv",encoding="ISO-8859-1")


#### What I Did

- I imported the Pandas library and loaded the sales dataset named sales_data.csv using the correct file encoding (ISO-8859-1).
This step initializes the dataset so I can begin data cleaning.

- I displayed the first 5 records from the dataset using df.head().
This helps verify that the dataset was loaded correctly.

In [None]:
df.head()
df.info()

#### What I Did
- I displayed the first 5 records from the dataset using df.head().
- This helps verify that the dataset was loaded correctly.
  
### What the Output Shows
- The output shows the first few rows of the sales dataset including columns such as order details, customer contact names, and sales values.

#### Insights
- Viewing the initial records gives a quick understanding of the dataset structure and any obvious data-quality issues.

In [None]:
df['ORDERDATE']=pd.to_datetime(df['ORDERDATE'],errors='coerce')

#### What I Did

- I converted the ORDERDATE column from string format into a proper datetime format using pd.to_datetime().
- I used errors='coerce' to convert invalid dates into missing values.

In [None]:
df['CONTACTFIRISTNAME'] = df['CONTACTLASTNAME'].str.strip().str.title()
df['CONTACTLASTTNAME'] = df['CONTACTLASTNAME'].str.strip().str.title()

#### What I Did

- I cleaned the customer contact name columns by removing extra spaces and applying title-case formatting.
- This ensures consistency in first and last name fields.
#### Insights
- Cleaning text columns improves data quality and prevents issues during grouping or merging operations.


In [None]:
# df['CONTACTFIRSTNAME'].sample(10)
# mask=df['CONTACTLASTNAME'].str.contains('')
# mask.sum()
# df['CONTACTLASTNAME'].str.contains('  ').sum()
mask=df['CONTACTLASTNAME'].str.startswith(' ') | df['CONTACTLASTNAME'].str.endswith(' ')
mask.sum()



#### What I Did
- I created a boolean mask to detect entries in CONTACTLASTNAME that have unwanted leading or trailing spaces.
Then I counted how many such cases exist.

#### What the Output Shows
- The output shows an integer representing how many last names had extra spaces.

#### Insights
- Identifying these issues validates that the name-cleaning step was necessary.

In [None]:
df.info()

In [None]:
df.dropna(inplace=True)

In [None]:
df.isna().sum()

In [None]:
df.to_csv("sales_data_cleaned.csv", index=False)


#### What I did 
- I ran df.info() again to verify if the earlier cleaning operations updated data types and filled missing values.
- I dropped all rows containing missing values using df.dropna(inplace=True).
- This ensures that no null values remain in important columns.
- I checked each column for missing values using df.isna().sum().
- I exported the cleaned dataset into a new CSV file named sales_data_cleaned.csv without adding index numbers.

#### What the Output Shows
- No visible output.
- The cleaned file is saved to disk.

#### Insights
- This ensures a reproducible cleaned dataset that can be used in future analysis or modeling.