# **(Data Collection Notebook)**

## Objectives

- Download data from Kaggle and save it as raw data.
- Evaluate missing data
- Inspect the data and save it under outputs/datasets/collection

## Inputs

Dat set from Kaggle data.csv

## Outputs

Generate Dataset: outputs/datasets/collection/---

## Additional Comments


# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Download data from Kaggle and upload

In [None]:
!  pip install -r requirements.txt

In [None]:
# importing libraries
import numpy
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

We are using the following data from Kaggle [Kaggle URL](https://www.kaggle.com/datasets/vijayaadithyanvg/breast-cancer-prediction)

In [None]:
# reading data from the file
df=pd.read_csv("inputs/datasets/raw/data.csv")

In [None]:
df.head()

---

# Load and Inspect Kaggle data

In [None]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/data.csv")
df.head()

# DataFrame Summary

In [None]:
df.info()

# Data Cleaning

Missing or Null Data points

We will find any missing or null data points of the data set (if there is any) using the following pandas function.

In [None]:
import pandas as pd
df_raw_path = "inputs/datasets/raw/data.csv"
df = pd.read_csv(df_raw_path)
df.head(5)

Drop all missing or null data

In [None]:
vars_with_missing_data = df.columns[df.isna().sum() > 0].to_list()
vars_with_missing_data

In [None]:
from ydata_profiling import ProfileReport
if vars_with_missing_data:
    profile = ProfileReport(df=df[vars_with_missing_data], minimal=True)
    profile.to_notebook_iframe()
else:
    print("There are no variables with missing data")

Drop Duplicates

In [None]:
nr_rows = df.shape[0]
print('Number of rows: %d' % nr_rows)
df = df.drop_duplicates().reset_index(drop=True)
print('Number of dropped rows: %d' % (nr_rows - df.shape[0]))
print('Number of remainin rows: %d' % df.shape[0])

In [None]:
# print the first 5 rows of the dataframe
df.head(5)

In [None]:
# number of rows and columns in the dataset
df.shape

In [None]:
# getting some information about the data
df.info()

In [None]:
df.describe(include='all').transpose()

---

# Split Data into Train Test and Validation Sets

In [None]:
from sklearn.model_selection import train_test_split
TrainSet, TestSet, _, __ = train_test_split(
                                        df,
                                        df['id'],
                                        test_size=0.2,
                                        random_state=0)

print(f"TrainSet shape: {TrainSet.shape} \nTestSet shape: {TestSet.shape}")

In [None]:
variables_method = ['id', 'diagnosis' ]

print(f"* {len(variables_method)} variables to drop \n\n"
    f"{variables_method}")

NOTE

In [None]:
from feature_engine.selection import DropFeatures
imputer = DropFeatures(features_to_drop=variables_method)
imputer.fit(TrainSet)
df_method = imputer.transform(TrainSet)

In [None]:
from feature_engine.selection import DropFeatures
imputer = DropFeatures(features_to_drop=variables_method)
imputer.fit(TrainSet)

TrainSet, TestSet = imputer.transform(TrainSet) , imputer.transform(TestSet)

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In case you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/collection/cancer.csv') # create outputs/datasets/collection folder
except Exception as e:
  print(e)

  df.to_csv(f"outputs/datasets/collection/cancer.csv",index=False)
