---

## Mini-Project: Load, Clean, and Perform Simple Analysis on a CSV File using Pandas

### Objective
The goal of this mini-project is to demonstrate how to:
1. Load a CSV file into a Pandas DataFrame.
2. Clean the data by handling missing values, duplicates, and other common issues.
3. Perform basic analysis and summarize key insights.

---

### Steps

#### **Step 1: Load the CSV File**

- **Using `pd.read_csv()`**  
  Begin by loading a dataset from a CSV file. Ensure the CSV file is accessible (can be local or online).
  
  Example:
  ```python
  import pandas as pd
  df = pd.read_csv('data.csv')  # Load the CSV file
  ```

- **Inspect the Data**  
  Use basic methods like `head()`, `info()`, and `describe()` to get an initial look at the data.
  
  Example:
  ```python
  print(df.head())         # View the first 5 rows
  print(df.info())         # Overview of data types and null values
  print(df.describe())     # Statistical summary of numeric columns
  ```

---

#### **Step 2: Data Cleaning**

- **Handle Missing Data**  
  Identify missing values in the dataset and decide how to handle them. Options include filling them with specific values (`fillna()`), dropping rows/columns (`dropna()`), or imputing missing data.
  
  Example:
  ```python
  df.fillna(0, inplace=True)  # Replace missing values with 0
  ```

- **Remove Duplicates**  
  Ensure the dataset does not contain duplicate rows by using `drop_duplicates()`.
  
  Example:
  ```python
  df.drop_duplicates(inplace=True)
  ```

- **Fix Data Types**  
  Sometimes columns may have incorrect data types. Convert them as necessary using `astype()`.
  
  Example:
  ```python
  df['column_name'] = df['column_name'].astype('int')
  ```

- **Rename Columns (if needed)**  
  Make the column names more descriptive or consistent if needed.
  
  Example:
  ```python
  df.rename(columns={'old_name': 'new_name'}, inplace=True)
  ```

---

#### **Step 3: Simple Data Analysis**

- **Summary Statistics**  
  Generate summary statistics like the mean, median, and standard deviation of numeric columns using `mean()`, `median()`, and `std()`.
  
  Example:
  ```python
  mean_value = df['column_name'].mean()
  median_value = df['column_name'].median()
  ```

- **Filter Data**  
  Extract a subset of the data based on conditions. For instance, selecting rows where a column value exceeds a threshold.
  
  Example:
  ```python
  filtered_data = df[df['column_name'] > 100]
  ```

- **Group and Aggregate**  
  Use `groupby()` and aggregation functions to analyze data. For instance, grouping data by a categorical column and calculating the mean for each group.
  
  Example:
  ```python
  grouped_data = df.groupby('category_column').mean()
  ```

- **Data Visualization (Optional)**  
  Use basic visualizations like histograms or bar plots to better understand the distribution of the data.
  
  Example:
  ```python
  df['column_name'].hist()
  ```

---

### Example Project

Here’s a full example using a sample dataset called `students.csv`:

```python
import pandas as pd

# Step 1: Load the CSV file
df = pd.read_csv('students.csv')

# Step 2: Inspect the Data
print(df.head())
print(df.info())
print(df.describe())

# Step 3: Handle Missing Data
df.fillna(df.mean(), inplace=True)  # Fill missing numeric data with the mean

# Step 4: Remove Duplicates
df.drop_duplicates(inplace=True)

# Step 5: Fix Data Types
df['age'] = df['age'].astype(int)

# Step 6: Rename Columns
df.rename(columns={'math_score': 'Math Score', 'reading_score': 'Reading Score'}, inplace=True)

# Step 7: Simple Analysis
# 7.1 Summary Statistics
print(df['Math Score'].mean())
print(df['Reading Score'].median())

# 7.2 Filter Data
above_average_math = df[df['Math Score'] > 75]
print(above_average_math)

# 7.3 Grouping and Aggregation
grouped_by_gender = df.groupby('gender').mean()
print(grouped_by_gender)

# (Optional) Visualization
df['Math Score'].hist()
```

---

### Deliverables
1. **Cleaned CSV File**: After processing, save the cleaned data to a new CSV file.
   Example:
   ```python
   df.to_csv('cleaned_data.csv', index=False)
   ```

2. **Summary Report**: Provide a summary of the key insights from the data, including any interesting patterns or trends identified during analysis.

---

### Questions

1. What method is used to load data from a CSV file into a DataFrame in Pandas?
2. How can you handle missing data in a dataset using Pandas?
3. What is the difference between `loc[]` and `iloc[]` when accessing data in a DataFrame?
4. How can you filter data based on specific conditions in Pandas?
5. Explain how `groupby()` works in Pandas and give an example.

---

### Summary

In this mini-project, you learned how to load data from a CSV file using Pandas, clean it by handling missing values and duplicates, and perform basic data analysis such as filtering, grouping, and generating summary statistics. This workflow is essential for any data-driven project, ensuring that the dataset is clean and ready for more advanced analysis.

---