In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

import wrangle

In [2]:
df = wrangle.get_data_from_sql()

### Acquire `customer_id`, `monthly_charges`, `tenure`, and `total_charges` from `telco_churn` database for all customers with a 2 year contract.

### Read in df from SQL

>Below is what I begin adding to my wrangle_telco.py file.


- I need the imports to run my functions in my .py file.


- Go to 'Pancakes' first to create a SQL query that pulls what I need.


- I create a function to gain access to the database first providing me with the url I need in my next function.

`get_db_url(db_name)`


- I then create a function that uses the query I created using mysql (Pancakes) and the Pandas function `pd.read_sql(query, url)` to return a df to me.

`get_data_from_sql()`

>Here's what the above looks like in my wrangle.py file so far. By calling `get_data_from_sql()`, I have my basic df ready to prep.

`import pandas as pd`
`import numpy as np`

`from env import host, user, password`

`def get_db_url(db_name):
    return f"mysql+pymysql://{user}:{password}@{host}/{db_name}"`


`def get_data_from_sql():
    query = """
    SELECT customer_id, monthly_charges, tenure, total_charges
    FROM customers
    WHERE contract_type_id = 3;
    """
    df = pd.read_sql(query, get_db_url('telco_churn'))
    return df`

<font color=green>Now that I have the module containing my functions above, it is easy to bring in the dataframe so I can explore for cleaning and prepping.</font>

In [6]:
df = wrangle.get_data_from_sql()

### Shut the front door!!! It really worked.....

### Now let's walk through the steps above using the new dataframe. There are some missing values so handle them as you please.

<font color=blue>Let's take a look at the dataframe</font>

In [7]:
df.head()

Unnamed: 0,customer_id,monthly_charges,tenure,total_charges
0,0013-SMEOE,109.7,71,7904.25
1,0014-BMAQU,84.65,63,5377.8
2,0016-QLJIS,90.45,65,5957.9
3,0017-DINOC,45.2,54,2460.55
4,0017-IUDMW,116.8,72,8456.75


In [8]:
print(f'My df has {df.shape[0]} rows and {df.shape[1]} columns.')

My df has 1695 rows and 4 columns.


In [10]:
df.customer_id.nunique() # nunique checks how many values are unique in this column

1695

### customer_id is the unique identifer for customers

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1695 entries, 0 to 1694
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   customer_id      1695 non-null   object 
 1   monthly_charges  1695 non-null   float64
 2   tenure           1695 non-null   int64  
 3   total_charges    1695 non-null   object 
dtypes: float64(1), int64(1), object(2)
memory usage: 53.1+ KB


### OH MY GOODNESS.... <font color=red>total_charges</font> is an object why

In [12]:
df.total_charges.value_counts(dropna=False)

           10
844.45      2
1110.05     2
3533.6      2
7334.05     2
           ..
487.95      1
1524.85     1
8289.2      1
2754        1
5224.95     1
Name: total_charges, Length: 1678, dtype: int64

### <font color=red>total_charges</font> has 10 values that are either spaces or blanks. We need to figure out which so we can decide how to handle the missing values

In [15]:
df[df.total_charges == ' '] 
# here we have identified the missing values and the rows they are on


Unnamed: 0,customer_id,monthly_charges,tenure,total_charges
234,1371-DWPAZ,56.05,0,
416,2520-SGTTA,20.0,0,
453,2775-SEFEE,61.9,0,
505,3115-CZMZD,20.25,0,
524,3213-VVOLG,25.35,0,
678,4075-WKNIU,73.35,0,
716,4367-NUYAO,25.75,0,
726,4472-LVYGI,52.55,0,
941,5709-LVOEQ,80.85,0,
1293,7644-OMVMY,19.85,0,
