<a href="https://colab.research.google.com/github/A-Kutscher/Project-2/blob/main/Project_2_Part_1_(Core)_AK.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Project 2 - Part 1 (Core)**
Author: Amber Kutscher

## Assignment

You will have a lot more freedom in your second project than in your first project. This is because we want you to have a project in your portfolio that interests you or relates to the industry you would like to work in.

Your task for this week is to propose two possible datasets you would like to work with for Project 2 and indicate which is your first and which is your second choice.

**You can choose datasets from either:**

***1. This list of pre-approved datasets and complete the tasks listed below for both of them***

**Pre-approved Datasets:**
- [Adult income dataset](https://www.kaggle.com/datasets/wenruliu/adult-income-dataset)
- [Car Insurance Data](https://www.kaggle.com/datasets/sagnik1511/car-insurance-data)
- [Metabolic Syndrome Prediction - dataset by informatics-edu](https://data.world/informatics-edu/metabolic-syndrome-prediction)
- [Pump It Up Challenge: Driven Data](https://www.kaggle.com/datasets/sumeetsawant/pump-it-up-challenge-driven-data?select=training_Set_values.csv)
- [Cirrhosis Prediction Dataset](https://www.kaggle.com/datasets/fedesoriano/cirrhosis-prediction-dataset)
- [Spanish Wine Quality Dataset](https://www.kaggle.com/datasets/fedesoriano/spanish-wine-quality-dataset)


***2. Other dataset(s) that align with an interest or field of expertise you already have.***

**If you choose to source your own data, please be sure it meets the below requirements:**
  - has a clear target or labels to predict
  - has potential for interesting analysis
  - the data must be available for use (it is your responsibility to ensure that the license states that you are able to use it.)
  - the data must be appropriate for a professional environment
  - the data must not contain personal information
  - the data must not be a dataset used for any assignment, lecture, or task from the course
  - the data must not be a time series dataset. You will be able to identify these because each row will represent a moment or interval of time. These kinds of datasets follow special rules and are not appropriate for the kind of machine - learning you have learned in this stack.
  - is reasonable to work with during the time you have

**A great dataset would also have:**
  - 8-25 columns (features)
  - 1000 - 100,000 rows (instances)
  - both numeric and categorical features
  - At least some missing features in both categorical and numeric columns, so you can show off your data-cleaning skills!

You may choose a regression or classification problem for each proposed data set, but you might consider a classification project since you already have a regression project in your portfolio. This way you can show your skills with both!

If you choose 2 datasets from outside of the pre-approved list, at least one of the datasets you choose must be approved by the instructor before you can use it for your project.

## Instructions

Whether you chose 2 pre-approved datasets or 2 datasets from another source or one of each:

Create a notebook where you have uploaded and shown the .head() of each of your data sets. For each of the proposed datasets, include the following information in text cells:

**First choice: dataset 1**
1. Source of data

2. Brief description of data

3. What is the target?

4. What does one row represent? (A person? A business? An event? A product?)

5. Is this a classification or regression problem?

6. How many features does the data have?

7. How many rows are in the dataset?

8. What, if any, challenges do you foresee in cleaning, exploring, or modeling this dataset?

**Second choice: dataset 2**
1. Source of data

2. Brief description of data

3. What is the target?

4. What does one row represent? (A person? A business? An event? A product?)

5. Is this a classification or regression problem?

6. How many features does the data have?

7. How many rows are in the dataset?

8. What, if any, challenges do you foresee in cleaning, exploring, or modeling this dataset?

________________________________________________________________________

## Imports and Mount Drive

In [1]:
# Imports
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (classification_report, confusion_matrix,
                             accuracy_score, confusion_matrix,
                             ConfusionMatrixDisplay, RocCurveDisplay)
from imblearn.over_sampling import SMOTE, RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from sklearn import set_config
set_config(transform_output='pandas')

In [2]:
# Mount drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


_______________________________________________________________

## **Dataset 1**

### Data Disctionary

- ID: unique identifier

- N_Days: number of days between registration and the earlier of death, transplantation, or study analysis time in July 1986

- Status: status of the patient C (censored), CL (censored due to liver tx), or D (death)

- Drug: type of drug D-penicillamine or placebo

- Age: age in [days]

- Sex: M (male) or F (female)

- Ascites: presence of ascites N (No) or Y (Yes)

- Hepatomegaly: presence of hepatomegaly N (No) or Y (Yes)

- Spiders: presence of spiders N (No) or Y (Yes)

- Edema: presence of edema N (no edema and no diuretic therapy for edema), S (edema present without diuretics, or edema resolved by diuretics), or Y (edema despite diuretic therapy)

- Bilirubin: serum bilirubin in [mg/dl]

- Cholesterol: serum cholesterol in [mg/dl]

- Albumin: albumin in [gm/dl]

- Copper: urine copper in [ug/day]

- Alk_Phos: alkaline phosphatase in [U/liter]

- SGOT: SGOT in [U/ml]

- Triglycerides: triglicerides in [mg/dl]

- Platelets: platelets per cubic [ml/1000]

- Prothrombin: prothrombin time in seconds [s]

- Stage: histologic stage of disease (1, 2, 3, or 4)

### Load Dataset

In [3]:
# Load in data
fpath_ds1 = "/content/drive/MyDrive/CodingDojo/02-MachineLearning/Week07/Data/cirrhosis[1].csv"
df_ds1 = pd.read_csv(fpath_ds1)
df_ds1.head()

Unnamed: 0,ID,N_Days,Status,Drug,Age,Sex,Ascites,Hepatomegaly,Spiders,Edema,Bilirubin,Cholesterol,Albumin,Copper,Alk_Phos,SGOT,Tryglicerides,Platelets,Prothrombin,Stage
0,1,400,D,D-penicillamine,21464,F,Y,Y,Y,Y,14.5,261.0,2.6,156.0,1718.0,137.95,172.0,190.0,12.2,4.0
1,2,4500,C,D-penicillamine,20617,F,N,Y,Y,N,1.1,302.0,4.14,54.0,7394.8,113.52,88.0,221.0,10.6,3.0
2,3,1012,D,D-penicillamine,25594,M,N,N,N,S,1.4,176.0,3.48,210.0,516.0,96.1,55.0,151.0,12.0,4.0
3,4,1925,D,D-penicillamine,19994,F,N,Y,Y,S,1.8,244.0,2.54,64.0,6121.8,60.63,92.0,183.0,10.3,4.0
4,5,1504,CL,Placebo,13918,F,N,Y,Y,N,3.4,279.0,3.53,143.0,671.0,113.15,72.0,136.0,10.9,3.0


In [4]:
df_ds1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 20 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   ID             418 non-null    int64  
 1   N_Days         418 non-null    int64  
 2   Status         418 non-null    object 
 3   Drug           312 non-null    object 
 4   Age            418 non-null    int64  
 5   Sex            418 non-null    object 
 6   Ascites        312 non-null    object 
 7   Hepatomegaly   312 non-null    object 
 8   Spiders        312 non-null    object 
 9   Edema          418 non-null    object 
 10  Bilirubin      418 non-null    float64
 11  Cholesterol    284 non-null    float64
 12  Albumin        418 non-null    float64
 13  Copper         310 non-null    float64
 14  Alk_Phos       312 non-null    float64
 15  SGOT           312 non-null    float64
 16  Tryglicerides  282 non-null    float64
 17  Platelets      407 non-null    float64
 18  Prothrombi

### Questions & Answers

1. Source of data
  - [Cirrhosis Prediction Dataset](https://www.kaggle.com/datasets/fedesoriano/cirrhosis-prediction-dataset)

2. Brief description of data
  - Cirrhosis is a late stage of liver scarring (fibrosis) brought on by a variety of liver disorders and conditions, including prolonged alcoholism and hepatitis. The information gathered during the 1974-1984 Mayo Clinic trial on primary biliary cirrhosis (PBC) of the liver is presented here. 424 PBC patients who were referred to Mayo Clinic during the course of that ten-year period qualified for the drug D-penicillamine's randomized placebo-controlled trial. The dataset's initial 312 cases, which took part in the randomized study, include mostly complete data. Although the extra 112 individuals declined to take part in the clinical experiment, they agreed to have some basic measurements taken and to be monitored for survival. The data presented here include 312 randomized individuals as well as an extra 106 cases because six of those cases were lost to follow-up shortly after diagnosis.

3. What is the target?
  - Status

4. What does one row represent? (A person? A business? An event? A product?)
  - Each row represents a patient (person).

5. Is this a classification or regression problem?
  - This is a regression problem.

6. How many features does the data have?
  - This dataset has 20 features.

7. How many rows are in the dataset?
  - There are 418 rows.

8. What, if any, challenges do you foresee in cleaning, exploring, or modeling this dataset?
  - None that I can think of at the moment.

____________________________________________________________________________

## **Dataset 2**

### Load Dataset

In [11]:
# Load in data
fpath_ds2 = "/content/drive/MyDrive/CodingDojo/02-MachineLearning/Week07/Data/pokemons[1].csv"
df_ds2 = pd.read_csv(fpath_ds2)
df_ds2.head()

Unnamed: 0,NationalNumber,Pokemon,Type_1,Type_2,Normal,Fire,Water,Electric,Grass,Ice,...,Ground,Flying,Psychic,Bug,Rock,Ghost,Dragon,Dark,Steel,Fairy
0,#001,Bulbasaur,Grass,Poison,1.0,2.0,0.5,0.5,0.25,2.0,...,1.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,0.5
1,#002,Ivysaur,Grass,Poison,1.0,2.0,0.5,0.5,0.25,2.0,...,1.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,0.5
2,#003,Venusaur,Grass,Poison,1.0,2.0,0.5,0.5,0.25,2.0,...,1.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,0.5
3,#004,Charmander,Fire,,1.0,0.5,2.0,1.0,0.5,0.5,...,2.0,1.0,1.0,0.5,2.0,1.0,1.0,1.0,0.5,0.5
4,#005,Charmeleon,Fire,,1.0,0.5,2.0,1.0,0.5,0.5,...,2.0,1.0,1.0,0.5,2.0,1.0,1.0,1.0,0.5,0.5


In [12]:
df_ds2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 893 entries, 0 to 892
Data columns (total 22 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   NationalNumber  893 non-null    object 
 1   Pokemon         893 non-null    object 
 2   Type_1          893 non-null    object 
 3   Type_2          443 non-null    object 
 4   Normal          893 non-null    float64
 5   Fire            893 non-null    float64
 6   Water           893 non-null    float64
 7   Electric        893 non-null    float64
 8   Grass           893 non-null    float64
 9   Ice             893 non-null    float64
 10  Fighting        893 non-null    float64
 11  Poison          893 non-null    float64
 12  Ground          893 non-null    float64
 13  Flying          893 non-null    float64
 14  Psychic         893 non-null    float64
 15  Bug             893 non-null    float64
 16  Rock            893 non-null    float64
 17  Ghost           893 non-null    flo

### Questions & Answers

1. Source of data
  - [Pokemon Gen 1-8 Dataset](https://www.kaggle.com/datasets/notlucasp/pokemon-gen-18-dataset)

2. Brief description of data
  - The 893 Pokemon (from generation 1 to generation 8) in this dataset are all listed with their National Number, Pokemon Name, Type 1 and Type 2, as well as the corresponding Defense Multipliers for each of the 18 Pokemon types (Normal, Fire, Water, Electric, Grass, Ice, Fighting, Poison, Ground, Flying, Psychic, Bug, Rock, Ghost, Dragon, Dark, Steel, and Fairy).

3. What is the target?
  - Type 1

4. What does one row represent? (A person? A business? An event? A product?)
  - Each row represents a Pokemon.

5. Is this a classification or regression problem?
  - This is a classification problem.

6. How many features does the data have?
  - This dataset has 22 features.

7. How many rows are in the dataset?
  - There are 893 rows.

8. What, if any, challenges do you foresee in cleaning, exploring, or modeling this dataset?
  - I may not have enough data to properly predict the Pokemon type, as I am missing data on HP, Attack, Defense, Special Attack, Special Defense, and Speed.

____________________________________________________________________________