In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

import wrangle

## Acquire customer_id, monthly_charges, tenure, and total_charges from telco_churn database for all customers with a 2 year contract.

### Read in df from SQL

>Below is what I begin adding to my wrangle_telco.py file.


- I need the imports to run my functions in my .py file.


- Go to 'Pancakes' first to create a SQL query that pulls what I need.


- I create a function to gain access to the database first providing me with the url I need in my next function.

`get_db_url(db_name)`


- I then create a function that uses the query I created using mysql (Pancakes) and the Pandas function `pd.read_sql(query, url)` to return a df to me.

`get_data_from_sql()`

>Here's what the above looks like in my wrangle.py file so far. By calling `get_data_from_sql()`, I have my basic df ready to prep.

`import pandas as pd`
`import numpy as np`

`from env import host, user, password`

`def get_db_url(db_name):
    return f"mysql+pymysql://{user}:{password}@{host}/{db_name}"`


`def get_data_from_sql():
    query = """
    SELECT customer_id, monthly_charges, tenure, total_charges
    FROM customers
    WHERE contract_type_id = 3;
    """
    df = pd.read_sql(query, get_db_url('telco_churn'))
    return df`

-**<font color=purple>Once you have a module that contains the fuctions above, it's easy to bring in your df to explore for cleaning and prep. Look how clean it looks in the simple function below! All the messy workings, the hard work, are behind the scenes!</font>**

In [2]:
df = wrangle.get_data_from_sql()

### It works! Add this to my wrangle.py file

## Walk through the steps above using your new dataframe. You may handle the missing values however you feel is appropriate.

### First Look at df

In [3]:
df.head()

Unnamed: 0,customer_id,monthly_charges,tenure,total_charges
0,0013-SMEOE,109.7,71,7904.25
1,0014-BMAQU,84.65,63,5377.8
2,0016-QLJIS,90.45,65,5957.9
3,0017-DINOC,45.2,54,2460.55
4,0017-IUDMW,116.8,72,8456.75


In [4]:
print(f'My df has {df.shape[0]} rows and {df.shape[1]} columns.')

My df has 1695 rows and 4 columns.


In [5]:
df.customer_id.nunique()

1695

### It looks like customer_id is a unique identifier for customers.

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1695 entries, 0 to 1694
Data columns (total 4 columns):
customer_id        1695 non-null object
monthly_charges    1695 non-null float64
tenure             1695 non-null int64
total_charges      1695 non-null object
dtypes: float64(1), int64(1), object(2)
memory usage: 53.1+ KB


### total_charges is an object... find out why

In [7]:
df.total_charges.value_counts(dropna=False)

           10
3533.6      2
1110.05     2
5682.25     2
7334.05     2
           ..
20.45       1
2799.75     1
3632        1
3389.25     1
6586.85     1
Name: total_charges, Length: 1678, dtype: int64

### total_charges has 10 values that are either spaces or blanks. Once I figure out which, I can decide how to handle the 'missing' values.

In [8]:
df[df.total_charges == ' ']

Unnamed: 0,customer_id,monthly_charges,tenure,total_charges
234,1371-DWPAZ,56.05,0,
416,2520-SGTTA,20.0,0,
453,2775-SEFEE,61.9,0,
505,3115-CZMZD,20.25,0,
524,3213-VVOLG,25.35,0,
678,4075-WKNIU,73.35,0,
716,4367-NUYAO,25.75,0,
726,4472-LVYGI,52.55,0,
941,5709-LVOEQ,80.85,0,
1293,7644-OMVMY,19.85,0,


### I could simply drop those rows...

In [9]:
# Filter my dataframe

df2 = df[df.total_charges != ' ']

In [10]:
# Validate that total_charges all have values

df2[df2.total_charges == ' ']

Unnamed: 0,customer_id,monthly_charges,tenure,total_charges


In [29]:
# More validating...

df2.total_charges.value_counts(dropna=True).sort_index()

20.35      1
20.45      1
52.00      1
68.80      1
76.65      1
          ..
8547.15    1
8564.75    1
8594.40    1
8670.10    1
8672.45    1
Name: total_charges, Length: 1677, dtype: int64

In [12]:
df2.total_charges = df2.total_charges.astype(float)

In [13]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1685 entries, 0 to 1694
Data columns (total 4 columns):
customer_id        1685 non-null object
monthly_charges    1685 non-null float64
tenure             1685 non-null int64
total_charges      1685 non-null float64
dtypes: float64(2), int64(1), object(1)
memory usage: 65.8+ KB


### BUT... It looks like those values are blank bc the tenure is 0. I will change it to 1 since they were probably customers for about a month.

In [14]:
df.tenure.value_counts().sort_index()

0      10
1       2
2       1
3       3
4       4
     ... 
68     65
69     66
70     88
71    137
72    343
Name: tenure, Length: 73, dtype: int64

In [15]:
# Replace any tenures of 0 with 1

df.tenure.replace(0, 1, inplace=True)

In [16]:
# Validate my tenure count for value 1

df.tenure.value_counts().sort_index()

1      12
2       1
3       3
4       4
5       1
     ... 
68     65
69     66
70     88
71    137
72    343
Name: tenure, Length: 72, dtype: int64

In [17]:
df[df.tenure == 1]

Unnamed: 0,customer_id,monthly_charges,tenure,total_charges
188,1099-GODLO,20.35,1,20.35
234,1371-DWPAZ,56.05,1,
416,2520-SGTTA,20.0,1,
453,2775-SEFEE,61.9,1,
505,3115-CZMZD,20.25,1,
524,3213-VVOLG,25.35,1,
678,4075-WKNIU,73.35,1,
716,4367-NUYAO,25.75,1,
726,4472-LVYGI,52.55,1,
941,5709-LVOEQ,80.85,1,


In [18]:
# Replace the blank total_charges with the monthly_charge for tenure == 1

df.total_charges.replace(' ', df.monthly_charges, inplace=True)

In [19]:
# Validate my changes

df[df.tenure == 1]

Unnamed: 0,customer_id,monthly_charges,tenure,total_charges
188,1099-GODLO,20.35,1,20.35
234,1371-DWPAZ,56.05,1,56.05
416,2520-SGTTA,20.0,1,20.0
453,2775-SEFEE,61.9,1,61.9
505,3115-CZMZD,20.25,1,20.25
524,3213-VVOLG,25.35,1,25.35
678,4075-WKNIU,73.35,1,73.35
716,4367-NUYAO,25.75,1,25.75
726,4472-LVYGI,52.55,1,52.55
941,5709-LVOEQ,80.85,1,80.85


In [20]:
df.total_charges = df.total_charges.astype(float)

In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1695 entries, 0 to 1694
Data columns (total 4 columns):
customer_id        1695 non-null object
monthly_charges    1695 non-null float64
tenure             1695 non-null int64
total_charges      1695 non-null float64
dtypes: float64(2), int64(1), object(1)
memory usage: 53.1+ KB


## Create my wrangle_telco() function and a wrangle.py file to reflect the prep I want

- Again, all the hard work you did above will be the guts in your wrangle function making it very simple to acquire this data in the same way, making it a repeatable process, again and again.


- Remember, this is the basic process you will go through in projects to aquire and prep your data and create modules containing the functions you build along the way.

In [22]:
def wrangle_telco():
    df = wrangle.get_data_from_sql()
    df.tenure.replace(0, 1, inplace=True)
    df.total_charges.replace(' ', df.monthly_charges, inplace=True)
    df.total_charges = df.total_charges.astype(float)
    return df

In [23]:
# Validate I can call my function from the wrangle_telco module

wrangle_telco()

Unnamed: 0,customer_id,monthly_charges,tenure,total_charges
0,0013-SMEOE,109.70,71,7904.25
1,0014-BMAQU,84.65,63,5377.80
2,0016-QLJIS,90.45,65,5957.90
3,0017-DINOC,45.20,54,2460.55
4,0017-IUDMW,116.80,72,8456.75
...,...,...,...,...
1690,9964-WBQDJ,24.40,71,1725.40
1691,9972-EWRJS,19.25,67,1372.90
1692,9975-GPKZU,19.75,46,856.50
1693,9993-LHIEB,67.85,67,4627.65


In [24]:
# Disco!

df = wrangle_telco()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1695 entries, 0 to 1694
Data columns (total 4 columns):
customer_id        1695 non-null object
monthly_charges    1695 non-null float64
tenure             1695 non-null int64
total_charges      1695 non-null float64
dtypes: float64(2), int64(1), object(1)
memory usage: 53.1+ KB


## End with a python file wrangle.py that contains the function, wrangle_telco(), that will acquire the data and return a dataframe cleaned with no missing values.

### Test Reading in df Using wrangle_telco() module

- After I build my function and run it, I will add it to my wrangle.py file and make sure I can actaully call it from the module, not just run it from the function in my notebook.

In [25]:
# Test importing and calling my function to get my prepped df

df = wrangle.wrangle_telco()

In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1695 entries, 0 to 1694
Data columns (total 4 columns):
customer_id        1695 non-null object
monthly_charges    1695 non-null float64
tenure             1695 non-null int64
total_charges      1695 non-null float64
dtypes: float64(2), int64(1), object(1)
memory usage: 53.1+ KB


In [27]:
df.tenure.value_counts(dropna=False).sort_index()

1      12
2       1
3       3
4       4
5       1
     ... 
68     65
69     66
70     88
71    137
72    343
Name: tenure, Length: 72, dtype: int64

In [28]:
df.total_charges.value_counts(dropna=False).sort_index()

19.85      1
20.00      1
20.25      1
20.35      1
20.45      1
          ..
8547.15    1
8564.75    1
8594.40    1
8670.10    1
8672.45    1
Name: total_charges, Length: 1687, dtype: int64

### Looks Good! I can add this to my wrangle.py file