<a href="https://colab.research.google.com/github/Osmayda/Prediction-of-Product-Sales/blob/main/Datasets_for_Modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Load Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import make_pipeline

## Models
from sklearn.dummy import DummyRegressor
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier, BaggingClassifier
from sklearn.linear_model import LogisticRegression

## Metrics
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.metrics import classification_report, ConfusionMatrixDisplay, roc_auc_score, RocCurveDisplay

## Set global scikit-learn configuration 
from sklearn import set_config
## Display estimators as a diagram
set_config(display='diagram') # 'text' or 'diagram'}

# **Dataset - Stroke Prediction**

1. **Source of data**
  - https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset
2. **Brief description of data**
  - According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths.
  - This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status.
3. **What is the target?**
  - Column 'stroke', yes (1) and no (0)
4. **What does one row represent? (A person?  A business?  An event? A product?)**
  - A row represents a person
5. **Is this a classification or regression problem?**
  - Classification, predicting stroke: yes or no
6. **How many features does the data have?**
  - 11 features and 1 target variable
7. **How many rows are in the dataset?**
  - 5110 rows
8. **What, if any, challenges do you foresee in cleaning, exploring, or modeling this dataset?**
  - Missing data, removing unnecessary rows, finding corretions and mix of categorical and numerical data

In [5]:
stroke_df = pd.read_csv('/content/drive/MyDrive/Coding Dojo/Stack 2 Intro to Machine Learning/Week 7/dataset/healthcare-dataset-stroke-data.csv')
stroke_df.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [7]:
stroke_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5110 non-null   int64  
 1   gender             5110 non-null   object 
 2   age                5110 non-null   float64
 3   hypertension       5110 non-null   int64  
 4   heart_disease      5110 non-null   int64  
 5   ever_married       5110 non-null   object 
 6   work_type          5110 non-null   object 
 7   Residence_type     5110 non-null   object 
 8   avg_glucose_level  5110 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     5110 non-null   object 
 11  stroke             5110 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB


# **Dataset 2 - Kickstarter Projects**
1. **Source of data**
  - https://www.kaggle.com/datasets/ulrikthygepedersen/kickstarter-projects?resource=download
2. **Brief description of data**
  - Researchers and analysts can gain insights into the characteristics of successful and unsuccessful Kickstarter projects, such as funding targets, project categories, and funding sources. This information can be used to inform investment decisions and guide future crowdfunding campaigns.
3. **What is the target?**
  - Column 'State', Canceled, Failed, Successful, 
4. **What does one row represent? (A person?  A business?  An event? A product?)**
  - A row represents businesses
5. **Is this a classification or regression problem?**
  - Classification, predicting whether projects get funded: Canceled, Failed, Successful,
6. **How many features does the data have?**
  - 10 features and 1 target variable
7. **How many rows are in the dataset?**
  - 374853 rows
8. **What, if any, challenges do you foresee in cleaning, exploring, or modeling this dataset?**
  - Missing data, irrelevant columns, mix of categorical and numerical data and determining metrics to use

In [2]:
ks_df = pd.read_csv('/content/drive/MyDrive/Coding Dojo/Stack 2 Intro to Machine Learning/Week 7/dataset/Kickstarter.csv')
ks_df.head()

Unnamed: 0,ID,Name,Category,Subcategory,Country,Launched,Deadline,Goal,Pledged,Backers,State
0,1860890148,Grace Jones Does Not Give A F$#% T-Shirt (limi...,Fashion,Fashion,United States,4/21/2009 21:02,5/31/2009,1000,625,30,Failed
1,709707365,CRYSTAL ANTLERS UNTITLED MOVIE,Film & Video,Shorts,United States,4/23/2009 0:07,7/20/2009,80000,22,3,Failed
2,1703704063,drawing for dollars,Art,Illustration,United States,4/24/2009 21:52,5/3/2009,20,35,3,Successful
3,727286,Offline Wikipedia iPhone app,Technology,Software,United States,4/25/2009 17:36,7/14/2009,99,145,25,Successful
4,1622952265,Pantshirts,Fashion,Fashion,United States,4/27/2009 14:10,5/26/2009,1900,387,10,Failed


In [4]:
ks_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 374853 entries, 0 to 374852
Data columns (total 11 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   ID           374853 non-null  int64 
 1   Name         374853 non-null  object
 2   Category     374853 non-null  object
 3   Subcategory  374853 non-null  object
 4   Country      374853 non-null  object
 5   Launched     374853 non-null  object
 6   Deadline     374853 non-null  object
 7   Goal         374853 non-null  int64 
 8   Pledged      374853 non-null  int64 
 9   Backers      374853 non-null  int64 
 10  State        374853 non-null  object
dtypes: int64(4), object(7)
memory usage: 31.5+ MB
