# Assignment 1: Data Cleaning 

# Introduction
## In this notebook, you'll download a dataset from Kaggle using the Kaggle API, load it into a Pandas DataFrame, and apply common data cleaning techniques. 
## Follow the steps and modify them to suit your dataset.

## Step 1: Kaggle API Setup
## To begin, you need to set up the Kaggle API to download datasets.

## 1.1 Install the Kaggle API
## If you haven't installed the Kaggle API yet, you can do so using the command below on git bash:


 $ !pip install kaggle


# 1.2 Authenticate the Kaggle API
# To use the Kaggle API, you'll need to authenticate. Follow these steps:

# - Go to https://www.kaggle.com/account, and under **API**, click "Create New API Token."
# - This will download a `kaggle.json` file containing your credentials.
# - Upload the `kaggle.json` file to your system put them in a location you can easily access.
# - Example for where my kaggle json file is located 

In [1]:
# Once the file is uploaded, run the following code to move it to the correct directory:
import os
import json

# Change the path below if your file is stored in another location.
kaggle_json_path = 'kaggle.json'

if os.path.exists(kaggle_json_path):
    with open(kaggle_json_path, 'r') as f:
        kaggle_api = json.load(f)
    os.makedirs(os.path.expanduser('~/.kaggle'), exist_ok=True)
    with open(os.path.expanduser('~/.kaggle/kaggle.json'), 'w') as f:
        json.dump(kaggle_api, f)
    os.chmod(os.path.expanduser('~/.kaggle/kaggle.json'), 0o600)
    print("Kaggle API credentials successfully set up!")
else:
    print("Please upload your 'kaggle.json' file first.")



Please upload your 'kaggle.json' file first.



# 1.3 Downloading the Dataset
### Now that you're authenticated, you can download your dataset from Kaggle. Replace `your-dataset` below with the dataset's name.

### Example Kaggle dataset: 'zillow/zecon' (Real Estate Dataset)
### You can search for datasets at https://www.kaggle.com/datasets



In [None]:
dataset = 'your-dataset-here'  # Update this with your chosen dataset from Kaggle
output_dir = './data'  # Directory to store your downloaded dataset

### Download the dataset using Kaggle API (run the next cell)



In [2]:
# Download the dataset using Kaggle API
!kaggle datasets download -d {dataset} -p {output_dir} --unzip


'kaggle' is not recognized as an internal or external command,
operable program or batch file.


# Data Cleaning: 
## Now that you have your data lets import important python libraries 

In [3]:
pip install -r requirements.txt


Note: you may need to restart the kernel to use updated packages.


ERROR: Could not find a version that satisfies the requirement python3 (from versions: none)
ERROR: No matching distribution found for python3


: 

Import your dataset into a pandas dataframe! 

In [None]:
dataset_path = os.path.join(output_dir, "your_dataset.csv")  # Replace with your dataset file
df = pd.read_csv(dataset_path)

# Display the first few rows of the dataset
df.head()

Here are some useful pandas functions we can use to help in this section: 
- Removing Duplicates:
- Data Type Conversion:
- Handling Outliers:

In [None]:
# Check for missing values  
df.isnull().sum()

#Drop rows with missing data:
df_cleaned = df.dropna()

#Fill missing values (e.g., with the mean):
df_cleaned = df.fillna(df.mean())

# Check for duplicates
df_cleaned = df_cleaned.drop_duplicates()

# Data Type Conversions if needed
# converting strings to dates

# Convert a column to datetime format
df_cleaned['date_column'] = pd.to_datetime(df_cleaned['date_column'])

In [None]:
# Calculate Q1 (25th percentile) and Q3 (75th percentile) for the numeric column
Q1 = df['numeric_column'].quantile(0.25)
Q3 = df['numeric_column'].quantile(0.75)

# Calculate IQR (Interquartile Range)
IQR = Q3 - Q1

# Define the acceptable range for non-outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Filter the dataset to remove outliers
df_no_outliers = df[(df['numeric_column'] >= lower_bound) & (df['numeric_column'] <= upper_bound)]

# View the cleaned dataset
df_no_outliers.head()




# Data Transformation 
Here we will be adding and removing unnessary columns on our dataframe 
I added additional python code you can apply to your dataframe
- Creating new Columns 
- Grouping and Aggregating Statistics 
- Saving dataset to a new CSV file: 


In [None]:

# Example: Creating a new column based on conditions
df_cleaned['new_column'] = df_cleaned['existing_column'].apply(lambda x: 'Category A' if x > 100 else 'Category B')

# Grouping data and calculating mean
df_grouped = df_cleaned.groupby('group_column')['numeric_column'].mean()

#save dataset to csv
cleaned_data_path = os.path.join(output_dir, "cleaned_dataset.csv")
df_cleaned.to_csv(cleaned_data_path, index=False)

print(f"Cleaned dataset saved to {cleaned_data_path}")