# Acquire Data for Classification

# Big Ideas
- Cache your data to speed up your data acquisition.
- Helper functions are your friends.


# Objectives
By the end of the acquire lesson and exercises, you will be able to...
- read data into a pandas DataFrame using the following modules:

In [1]:
# # pydataset

# from pydataset import data
# df = data('dataset_name')

In [2]:
# # seaborn datasets

# import seaborn as sns
# df = sns.load_dataset('dataset_name')

In [3]:
import pandas as pd
import numpy as np
import os

# visualize
import matplotlib.pyplot as plt
import seaborn as sns
plt.rc('figure', figsize=(8, 6))
plt.rc('font', size=13)

# turn off pink warning boxes
import warnings
warnings.filterwarnings("ignore")

# acquire
from env import host, user, password

# To access pydataset data table use:
from pydataset import data

4. In a jupyter notebook, `classification_exercises.ipynb`, use a python module (pydata or seaborn datasets) containing datasets as a source from the iris data. Create a pandas dataframe, `df_iris`, from this data.
- print the first 3 rows
- print the number of rows and columns (shape)
- print the column names
- print the data type of each column
- print the summary statistics for each of the numeric variables

## 4. Create `df_iris`

- Use a python module (pydata or seaborn datasets) containing datasets as a source for the iris data.

In [4]:
data('iris', show_doc=True)

iris

PyDataset Documentation (adopted from R Documentation. The displayed examples are in R)

## Edgar Anderson's Iris Data

### Description

This famous (Fisher's or Anderson's) iris data set gives the measurements in
centimeters of the variables sepal length and width and petal length and
width, respectively, for 50 flowers from each of 3 species of iris. The
species are _Iris setosa_, _versicolor_, and _virginica_.

### Usage

    iris
    iris3

### Format

`iris` is a data frame with 150 cases (rows) and 5 variables (columns) named
`Sepal.Length`, `Sepal.Width`, `Petal.Length`, `Petal.Width`, and `Species`.

`iris3` gives the same data arranged as a 3-dimensional array of size 50 by 4
by 3, as represented by S-PLUS. The first dimension gives the case number
within the species subsample, the second the measurements with names `Sepal
L.`, `Sepal W.`, `Petal L.`, and `Petal W.`, and the third the species.

### Source

Fisher, R. A. (1936) The use of multiple measurements in taxonomi

In [5]:
# Using pydataset

df_iris = data('iris')
df_iris.head(1)

# Does pydatataset not have the range column like seaborn does?  
# Also does it capitaliing column names affect anything?

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
1,5.1,3.5,1.4,0.2,setosa


In [6]:
# Using seaborn -- love the column names.

df_iris = sns.load_dataset('iris')
df_iris.head(1)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa


### Print the first 3 rows.

In [7]:
df_iris.head(3)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa


In [8]:
df_iris.iloc[0:3]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa


In [9]:
df_iris.shape

(150, 5)

--------------------------

### Print the column names.

In [10]:
df_iris.columns

Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
       'species'],
      dtype='object')

In [11]:
# Return a nice list of coluns if I want to grab and use them later.

df_iris.columns.to_list()


['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']

In [12]:
for column in df_iris.columns:
    print(column)

sepal_length
sepal_width
petal_length
petal_width
species


## Print the data type of each column.

In [13]:
# Return just data types.

df_iris.dtypes # For one data type it's just 'dtype'

sepal_length    float64
sepal_width     float64
petal_length    float64
petal_width     float64
species          object
dtype: object

In [14]:
df_iris.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


In [15]:
# This method returns the summary statistics for numeric variable in my df.

stats = df_iris.describe().T
stats

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
sepal_length,150.0,5.843333,0.828066,4.3,5.1,5.8,6.4,7.9
sepal_width,150.0,3.057333,0.435866,2.0,2.8,3.0,3.3,4.4
petal_length,150.0,3.758,1.765298,1.0,1.6,4.35,5.1,6.9
petal_width,150.0,1.199333,0.762238,0.1,0.3,1.3,1.8,2.5


In [16]:
# I can calculate a range for each numeric variable and select certain columns of interest.

stats['range'] = stats['max'] - stats['min']
stats

Unnamed: 0,count,mean,std,min,25%,50%,75%,max,range
sepal_length,150.0,5.843333,0.828066,4.3,5.1,5.8,6.4,7.9,3.6
sepal_width,150.0,3.057333,0.435866,2.0,2.8,3.0,3.3,4.4,2.4
petal_length,150.0,3.758,1.765298,1.0,1.6,4.35,5.1,6.9,5.9
petal_width,150.0,1.199333,0.762238,0.1,0.3,1.3,1.8,2.5,2.4


In [17]:
stats[['mean', '50%', 'std']]
# Use double brackets to make a list of the columns

Unnamed: 0,mean,50%,std
sepal_length,5.843333,5.8,0.828066
sepal_width,3.057333,3.0,0.435866
petal_length,3.758,4.35,1.765298
petal_width,1.199333,1.3,0.762238


In [18]:
subset_of_columns = ['mean', '50%', 'std']
stats[subset_of_columns]

Unnamed: 0,mean,50%,std
sepal_length,5.843333,5.8,0.828066
sepal_width,3.057333,3.0,0.435866
petal_length,3.758,4.35,1.765298
petal_width,1.199333,1.3,0.762238


5. Read the Table1_CustDetails table from your spreadsheet exercises google sheet into a dataframe named df_google_sheets.

Make sure that the spreadsheet is publicly visible under your sharing settings.
- assign the first 100 rows to a new dataframe, df_google_sheets_sample
- print the number of rows of your original dataframe
- print the first 5 column names
- print the column names that have a data type of object
- compute the range for each of the numeric variables.

## Create `df_google`
- Read the data from a Google sheet into a dataframe, df_google.

In [19]:
sheet_url = 'https://docs.google.com/spreadsheets/d/1kcrY0Q2IGFaEg0OgWxJORGCC0tNjH-L42Z0Q-4ajIUY/edit#gid=1023018493'
# Grabbed the Sheets URL.

In [20]:
csv_export_url = sheet_url.replace('/edit#gid=', '/export?format=csv&gid=')
# Turns the Sheets address into a CSV export URL.

In [21]:
df_google = pd.read_csv(csv_export_url)
df_google
# Uses the pandas '`pd.read_csv()` function to read the data

Unnamed: 0,customer_id,gender,is_senior_citizen,partner,dependents,phone_service,internet_service,contract_type,payment_type,monthly_charges,total_charges,churn,tenure
0,0002-ORFBO,Female,0,Yes,Yes,1,1,1,Mailed check,65.60,593.30,No,9.0
1,0003-MKNFE,Male,0,No,No,2,1,0,Mailed check,59.90,542.40,No,9.1
2,0004-TLHLJ,Male,0,No,No,1,2,0,Electronic check,73.90,280.85,Yes,3.8
3,0011-IGKFF,Male,1,Yes,No,1,2,0,Electronic check,98.00,1237.85,Yes,12.6
4,0013-EXCHZ,Female,1,Yes,No,1,2,0,Mailed check,83.90,267.40,Yes,3.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
7044,9987-LUTYD,Female,0,No,No,1,1,1,Mailed check,55.15,742.90,No,13.5
7045,9992-RRAMN,Male,0,Yes,No,2,2,0,Electronic check,85.10,1873.70,Yes,22.0
7046,9992-UJOEL,Male,0,No,No,1,1,0,Mailed check,50.30,92.75,No,1.8
7047,9993-LHIEB,Male,0,Yes,Yes,1,1,2,Mailed check,67.85,4627.65,No,68.2


In [22]:
# Print the first 3 rows.

df_google.head(3)

Unnamed: 0,customer_id,gender,is_senior_citizen,partner,dependents,phone_service,internet_service,contract_type,payment_type,monthly_charges,total_charges,churn,tenure
0,0002-ORFBO,Female,0,Yes,Yes,1,1,1,Mailed check,65.6,593.3,No,9.0
1,0003-MKNFE,Male,0,No,No,2,1,0,Mailed check,59.9,542.4,No,9.1
2,0004-TLHLJ,Male,0,No,No,1,2,0,Electronic check,73.9,280.85,Yes,3.8


In [23]:
# Print the number of rows and columns.
df_google.shape

(7049, 13)

In [24]:
# Print the column names.
df_google.columns.to_list()

['customer_id',
 'gender',
 'is_senior_citizen',
 'partner',
 'dependents',
 'phone_service',
 'internet_service',
 'contract_type',
 'payment_type',
 'monthly_charges',
 'total_charges',
 'churn',
 'tenure']

In [25]:
# Print the data type of each column.
df_google.dtypes

customer_id           object
gender                object
is_senior_citizen      int64
partner               object
dependents            object
phone_service          int64
internet_service       int64
contract_type          int64
payment_type          object
monthly_charges      float64
total_charges        float64
churn                 object
tenure               float64
dtype: object

In [26]:
# Print the sumary statistics for each of the numeric variables.
df_google.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
is_senior_citizen,7049.0,0.162009,0.368485,0.0,0.0,0.0,0.0,1.0
phone_service,7049.0,1.324585,0.642709,0.0,1.0,1.0,2.0,2.0
internet_service,7049.0,1.222585,0.779068,0.0,1.0,1.0,2.0,2.0
contract_type,7049.0,0.690878,0.833757,0.0,0.0,0.0,1.0,2.0
monthly_charges,7049.0,64.747014,30.09946,18.25,35.45,70.35,89.85,118.75
total_charges,7038.0,2283.043883,2266.521984,18.8,401.5875,1397.1,3793.775,8684.8
tenure,7049.0,32.380068,24.594926,0.0,8.7,28.7,55.2,79.3


## Print the unique values for each of your categorical variables.

In [27]:
for col in df_google.columns:
    
        if df_google[col].dtypes == 'object':
            print(f'{col} has {df_google[col].nunique()} unique vlaues.')

customer_id has 7043 unique vlaues.
gender has 2 unique vlaues.
partner has 2 unique vlaues.
dependents has 2 unique vlaues.
payment_type has 4 unique vlaues.
churn has 2 unique vlaues.


In [28]:
for col in df_google.columns:
    if df_google[col].dtypes == 'object':
        print(df_google[col].value_counts())

0048-LUMLS    2
0042-RLHYP    2
0042-JVWOJ    2
0040-HALCW    2
0036-IHMOT    2
             ..
3363-EWLGO    1
3363-DTIVD    1
3359-DSRKA    1
3354-OADJP    1
9995-HOTOH    1
Name: customer_id, Length: 7043, dtype: int64
Male      3558
Female    3491
Name: gender, dtype: int64
No     3642
Yes    3407
Name: partner, dtype: int64
No     4934
Yes    2115
Name: dependents, dtype: int64
Electronic check             2365
Mailed check                 1612
Bank transfer (automatic)    1548
Credit card (automatic)      1524
Name: payment_type, dtype: int64
No     5179
Yes    1870
Name: churn, dtype: int64


6. Download your spreadsheet exercises google sheet as an excel file (File → Download → Microsoft Excel). Read the Table1_CustDetails worksheet into a dataframe named df_excel.
- assign the first 100 rows to a new dataframe, df_excel_sample
- print the number of rows of your original dataframe
- print the first 5 column names
- print the column names that have a data type of object
- compute the range for each of the numeric variables.

## 6. Create `df_excel`
- Read the `Table1_CustDetails` table from the `Excel_Exercises.xlsx`, sheet_name='Table1_CustDetails')

In [29]:
help(pd.read_excel)

Help on function read_excel in module pandas.io.excel._base:

read_excel(io, sheet_name=0, header=0, names=None, index_col=None, usecols=None, squeeze=False, dtype: 'DtypeArg | None' = None, engine=None, converters=None, true_values=None, false_values=None, skiprows=None, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, parse_dates=False, date_parser=None, thousands=None, comment=None, skipfooter=0, convert_float=None, mangle_dupe_cols=True, storage_options: 'StorageOptions' = None)
    Read an Excel file into a pandas DataFrame.
    
    Supports `xls`, `xlsx`, `xlsm`, `xlsb`, `odf`, `ods` and `odt` file extensions
    read from a local filesystem or URL. Supports an option to read
    a single sheet or a list of sheets.
    
    Parameters
    ----------
    io : str, bytes, ExcelFile, xlrd.Book, path object, or file-like object
        Any valid string path is acceptable. The string could be a URL. Valid
        URL schemes include http, ftp, s3, and 

In [30]:
df_excel = pd.read_excel('Jason Turner - jemison_spreadsheet_exercises.xlsx', sheet_name='Table1_CustDetails')

In [31]:
# Assign the first 100 rows to a new dataframe, `df_excel_sample`.
df_excel.iloc[0:100]
df_excel_sample = df_excel.head(100)
df_excel_sample.shape

(100, 13)

In [32]:
# Print the number of rows of your original dataframe.
df_excel.shape[0]

7049

In [33]:
# Print the first 5 column names.
df_excel.columns[:5]

Index(['customer_id', 'gender', 'is_senior_citizen', 'partner', 'dependents'], dtype='object')

In [34]:
# Print the column names that have a data type of object.
df_excel.select_dtypes(include='object').head()

Unnamed: 0,customer_id,gender,partner,dependents,payment_type,churn
0,0002-ORFBO,Female,Yes,Yes,Mailed check,No
1,0003-MKNFE,Male,No,No,Mailed check,No
2,0004-TLHLJ,Male,No,No,Electronic check,Yes
3,0011-IGKFF,Male,Yes,No,Electronic check,Yes
4,0013-EXCHZ,Female,Yes,No,Mailed check,Yes


In [35]:
df_excel.select_dtypes(include='object').columns.tolist()

['customer_id', 'gender', 'partner', 'dependents', 'payment_type', 'churn']

In [36]:
df_excel.select_dtypes(include=['object', 'int64']).head()
# You can pass of list of the data types that you want to include or exclude.

Unnamed: 0,customer_id,gender,partner,dependents,payment_type,churn
0,0002-ORFBO,Female,Yes,Yes,Mailed check,No
1,0003-MKNFE,Male,No,No,Mailed check,No
2,0004-TLHLJ,Male,No,No,Electronic check,Yes
3,0011-IGKFF,Male,Yes,No,Electronic check,Yes
4,0013-EXCHZ,Female,Yes,No,Mailed check,Yes


In [37]:
# What if we want to exclude floats

df_excel.select_dtypes(exclude=['float64']).head()

Unnamed: 0,customer_id,gender,partner,dependents,payment_type,churn
0,0002-ORFBO,Female,Yes,Yes,Mailed check,No
1,0003-MKNFE,Male,No,No,Mailed check,No
2,0004-TLHLJ,Male,No,No,Electronic check,Yes
3,0011-IGKFF,Male,Yes,No,Electronic check,Yes
4,0013-EXCHZ,Female,Yes,No,Mailed check,Yes


### Compute the range for each of the numeric variables.

In [38]:
# Some of these numeric columns are more like encoded categorical variables.

df_excel.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
is_senior_citizen,7049.0,0.162009,0.368485,0.0,0.0,0.0,0.0,1.0
phone_service,7049.0,1.324585,0.642709,0.0,1.0,1.0,2.0,2.0
internet_service,7049.0,1.222585,0.779068,0.0,1.0,1.0,2.0,2.0
contract_type,7049.0,0.690878,0.833757,0.0,0.0,0.0,1.0,2.0
monthly_charges,7049.0,64.747014,30.09946,18.25,35.45,70.35,89.85,118.75
total_charges,7038.0,2283.043883,2266.521984,18.8,401.5875,1397.1,3793.775,8684.8
tenure,7049.0,32.379866,24.595524,0.0,8.733456,28.683425,55.229399,79.341772


In [39]:
# I can select just the true numeric variables to declutter my results.

telco_stats = df_excel[['monthly_charges', 'total_charges']].describe().T
telco_stats

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
monthly_charges,7049.0,64.747014,30.09946,18.25,35.45,70.35,89.85,118.75
total_charges,7038.0,2283.043883,2266.521984,18.8,401.5875,1397.1,3793.775,8684.8


In [40]:
telco_stats['range'] = telco_stats['max'] - telco_stats['min']
telco_stats

Unnamed: 0,count,mean,std,min,25%,50%,75%,max,range
monthly_charges,7049.0,64.747014,30.09946,18.25,35.45,70.35,89.85,118.75,100.5
total_charges,7038.0,2283.043883,2266.521984,18.8,401.5875,1397.1,3793.775,8684.8,8666.0


7. Read the data from this google sheet into a dataframe, df_google.
- print the first 3 rows
- print the number of rows and columns
- print the column names
- print the data type of each column
- print the summary statistics for each of the numeric variables
- print the unique values for each of your categorical variables

In [41]:
sheet_url = 'https://docs.google.com/spreadsheets/d/1Uhtml8KY19LILuZsrDtlsHHDC9wuDGUSe8LTEwvdI5g/edit#gid=341089357'
csv_export_url = sheet_url.replace('/edit#gid=', '/export?format=csv&gid=')
df_google = pd.read_csv(csv_export_url)
print(df_google.head(3))
print(df_google.shape)
print(df_google.columns.to_list())
print(df_google.info)
print(df_google.describe())
for col in df_google.columns:
    if df_google[col].dtypes == 'object':
        print(f'{col} has {df_google[col].nunique()} unique values.')

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   

                                                  Name     Sex   Age  SibSp  \
0                              Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Thayer)  female  38.0      1   
2                               Heikkinen, Miss. Laina  female  26.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
(891, 12)
['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']
<bound method DataFrame.info of      PassengerId  Survived  Pclass  \
0              1         0       3   
1              2         1       1   
2              3         1       3   
3              

Make a new python module, acquire.py

**Make sure your `env.py` and csv files are *not* being pushed to GitHub!**

# Exercise 1 for `acquire.py`
Make a function named `get_titanic_data` that returns the titanic data from the codeup data science database as a pandas data frame. Obtain your data from the Codeup Data Science Database.

In [42]:
import acquire

In [43]:
titanic_df = acquire.get_titanic_data()
titanic_df.head()

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,deck,embark_town,alone
0,0,0,3,male,22.0,1,0,7.25,S,Third,,Southampton,0
1,1,1,1,female,38.0,1,0,71.2833,C,First,C,Cherbourg,0
2,2,1,3,female,26.0,0,0,7.925,S,Third,,Southampton,1
3,3,1,1,female,35.0,1,0,53.1,S,First,C,Southampton,0
4,4,0,3,male,35.0,0,0,8.05,S,Third,,Southampton,1


## Exercise 2 for `acquire.py`

Make a function named get_iris_data that returns the data from the iris_db on the codeup data science database as a pandas data frame. The returned data frame should include the actual name of the species in addition to the species_ids. Obtain your data from the Codeup Data Science Database.

In [44]:
iris_df = acquire.get_iris_data()
iris_df.head()

Unnamed: 0,species_id,species_name,sepal_length,sepal_width,petal_length,petal_width
0,1,setosa,5.1,3.5,1.4,0.2
1,1,setosa,4.9,3.0,1.4,0.2
2,1,setosa,4.7,3.2,1.3,0.2
3,1,setosa,4.6,3.1,1.5,0.2
4,1,setosa,5.0,3.6,1.4,0.2


### Exercise 3 for `acquire.py`

Make a function named get_telco_data that returns the data from the telco_churn database in SQL. In your SQL, be sure to join all 4 tables together, so that the resulting dataframe contains all the contract, payment, and internet service options. Obtain your data from the Codeup Data Science Database.

In [45]:
telco_df = acquire.get_telco_data()
telco_df.head()

Unnamed: 0,payment_type_id,internet_service_type_id,contract_type_id,customer_id,gender,senior_citizen,partner,dependents,tenure,phone_service,...,tech_support,streaming_tv,streaming_movies,paperless_billing,monthly_charges,total_charges,churn,contract_type,internet_service_type,payment_type
0,2,1,2,0002-ORFBO,Female,0,Yes,Yes,9,Yes,...,Yes,Yes,No,Yes,65.6,593.3,No,One year,DSL,Mailed check
1,2,1,1,0003-MKNFE,Male,0,No,No,9,Yes,...,No,No,Yes,No,59.9,542.4,No,Month-to-month,DSL,Mailed check
2,1,2,1,0004-TLHLJ,Male,0,No,No,4,Yes,...,No,No,No,Yes,73.9,280.85,Yes,Month-to-month,Fiber optic,Electronic check
3,1,2,1,0011-IGKFF,Male,1,Yes,No,13,Yes,...,No,Yes,Yes,Yes,98.0,1237.85,Yes,Month-to-month,Fiber optic,Electronic check
4,2,2,1,0013-EXCHZ,Female,1,Yes,No,3,Yes,...,Yes,Yes,No,Yes,83.9,267.4,Yes,Month-to-month,Fiber optic,Mailed check


### Add Caching to the `acquire.py` functions

Once you've got your get_titanic_data, get_iris_data, and get_telco_data functions written, now it's time to add caching to them. To do this, edit the beginning of the function to check for the local filename of telco.csv, titanic.csv, or iris.csv. If they exist, use the .csv file. If the file doesn't exist, then produce the SQL and pandas necessary to create a dataframe, then write the dataframe to a .csv file with the appropriate name.

In [None]:
# if os.path.isfile('titanic_df.csv'):
        
#         # If csv file exists, read in data from csv file.
#         df = pd.read_csv('titanic_df.csv', index_col=0)
        
#     else:
        
#         # Read fresh data from db into a DataFrame.
#         df = new_titanic_data()
        
#         # Write DataFrame to a csv file.
#         df.to_csv('titanic_df.csv')
        
# if os.path.isfile('iris_df.csv'):
        
#         # If csv file exists read in data from csv file.
#         df = pd.read_csv('iris_df.csv', index_col=0)
        
#     else:
        
#         # Read fresh data from db into a DataFrame
#         df = new_iris_data()
        
#         # Cache data
#         df.to_csv('iris_df.csv')
        
# if os.path.isfile('telco.csv'):
        
#         # If csv file exists read in data from csv file.
#         df = pd.read_csv('telco.csv', index_col=0)
        
#     else:
        
#         # Read fresh data from db into a DataFrame
#         df = new_telco_data()
        
#         # Cache data
#         df.to_csv('telco.csv')

In [6]:
import pandas as pd
import numpy as np
import os

# visualize
import matplotlib.pyplot as plt
import seaborn as sns
plt.rc('figure', figsize=(8, 6))
plt.rc('font', size=13)

# turn off pink warning boxes
import warnings
warnings.filterwarnings("ignore")

# acquire
from env import host, user, password

# To access pydataset data table use:
from pydataset import data


In [10]:
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer

In [2]:
import acquire
iris_df = acquire.get_iris_data()
iris_df.head()

Unnamed: 0,species_id,species_name,sepal_length,sepal_width,petal_length,petal_width
0,1,setosa,5.1,3.5,1.4,0.2
1,1,setosa,4.9,3.0,1.4,0.2
2,1,setosa,4.7,3.2,1.3,0.2
3,1,setosa,4.6,3.1,1.5,0.2
4,1,setosa,5.0,3.6,1.4,0.2


In [None]:
def prep_iris_data

In [8]:
def

dummy_df = pd.get_dummies(df['species'], drop_first=False)

df = pd.concat([df, dummy_df], axis=1)

return df

NameError: name 'df' is not defined

In [None]:
def split_iris_data(df):
    '''
    take in a Data Frame and return train, validate, and test DataFrames; stratify on species.
    return train, validate, test DataFrames.
    '''
    
    #
    train_validate, test = train_test+split(df, test_size=.2, random_state=123, stratify=df.species)
    
    #
    train, validate = train_test_split(train_validate,
                                       
                                       stratify=train_valdiate.species)
    train, validate, test = split_iris_data(df)
    
    return train, validate, test

In [None]:
def clean_titanic_date(df):
    '''
    This function will clean the data prior to splitting.
    '''
    # Drops any duplicate values
    df = df.drop_duplicates()
    
    # Drops columns that are already represented by other columns
    cols_to_drop = ['deck', 'embarked', 'class']
    dr = df.drop(columns = cols_to_drop)
    
    # Fills the small number of null values for embark_town with the mode
    df['embark_town'] = df.emboark_town.fillna(value='Southampton')
    
    # Uses one-hot encoding to create dummies of string columns for future modeling
    dummy_df = pd.get_dummies(df[['sex', 'embark_town']], dummy_na=False, drop_first=[True, True])
    df = pd.concat([df, dummy_df], axis=1)
    
    return df

In [None]:
def split_titanic_data(df):
    '''
    
    '''
    #
    train, test = train_test_split(df, test_size = .2, random_stat=123, stratify=df.survived)
    
    #
    train, validate = train_test_split(train, test_size=.3, random_state=123, stratify=train.survived)
    
    return train, valideate, test
    

In [None]:
def impute_titanic_mode(train, validate, test):

In [None]:
def impute_mean_age(train, validate, test):

In [None]:
def prep_titanic_datae(df)

In [None]:
def 

In [None]:
def prep_telco_data(df):
    # Drop duplicate columns
    df.drop(columns=['payment_type_id', 'internet_service_type_id', 'contract_ty[e_id', 'customer_id'], inplace=True)
    
    # Drop null values stored as whitespace
    df['total_charges'] = df['total_charges'].str.strip()
    df = df[df.total_charges != '']
    
    # Convert to correct datatype
    df['total_charges'] = df.total_charges.astype(float)
    
    # Convert binary categorical variables to numeric.  It's similar to using one-hot
    df['gender_encoded'] = df.gender.map({'Female': 1, 'Male': 0})
    df['partner_encoded'] = df.partner.map({'Yes': 1, 'No': 0})
    df['dependents_encoded'] = df.dependents.map({'Yes': 1, 'No': 0})
    df['phone_service_encoded'] = df.phone_service.map({'Yes': 1, 'No': 0})
    df['churn_endcoded'] = df.churn.map({'Yes': 1, 'No': 0})
    
    # Get dummies for non-binary categorical variables
    dummy_df = pd.get_dummies(df[['multiple_lines', \
                                  'online_security', \
                                  'online_backup', \
                                  'tech_support', \
                                  'streaming_tv' \
                                  'streaming_movies', \
                                  'contract_type', \
                                  'internet_service_type', \
                                  'payment_type']], dummy_na=False, \
                                  drop_first=True)
    
    # Concatenate dummy dataframe to original
    df = pd.concat([df, dummy_df], axis=1)
    
    # split the data
    train, validate, test