# Strata Scratch

## Data Cleaning with Pandas Solutions

#### Topics
- Missing Values
    - Identify Missing Values
    - Drop Values
    - Impute Values 
        - Zero 
        - Mean
- Categorical Data 
    - Convert Text to Numbers 
    - Encode Labels as Boolean Variables 

Import pandas

In [None]:
import pandas as pd
import psycopg2 as ps

In [None]:
host_name = 'db-strata.stratascratch.com'
dbname = 'db_strata'
user_name = '' #enter username and password from profile tab in Strata Scratch
pwd = ''
port = '5432'

try:
    conn = ps.connect(host=host_name,database=dbname,user=user_name,password=pwd,port=port)
except ps.OperationalError as e:
    raise e
else:
    print('Connected!')

Pull data from the combine table or read the combine.csv file as a DataFrame and investigate the contents

In [None]:
#Make the database call
cur = conn.cursor()
cur.execute(""" 
            SELECT *  FROM datasets.combine; 
            """)
data = cur.fetchall()
colnames = [desc[0] for desc in cur.description] #grab the column names
conn.commit()

#create the dataframe
data = pd.DataFrame(df)
data.columns = colnames

#close the connection
cur.close()

In [None]:
data = pd.read_csv('combine.csv')

In [None]:
print(data.head())
print(data.info())
print(data.describe())

Fill missing values for college with 'No College'

- https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html

In [None]:
data.fillna(value='No College', inplace=True)

Remove the players that have null values for the pick

In [None]:
data_dropped = data.dropna(how='any', subset=['pick'])

print(data_dropped)

Investigate the unique values in the position column

In [None]:
data.keys()
data['position'].unique()
data.position.unique()

Replace RB and QB with Running Back and Quarterback

- https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.replace.html

In [None]:
data.position.replace(to_replace=['RB','QB'],value=['Running Back','Quarterback'], inplace=True)

data.head()

Create dummy values for position

- https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html

In [None]:
dummy_data = pd.get_dummies(data.position, prefix='Pos')

dummy_data.head()

Merge the dummy data with the original data set

- https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html

In [None]:
data.merge(dummy_data, how='inner', left_index=True, right_index=True) 

Convert weight from lbs to kg

In [None]:
data['weight_kg'] = data.weight/2.2
data.head()

Capitalize name

- https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html

In [None]:
# need the apply function if you're doing an operation on a column. Python doens't know to apply it to the entire dataset.
# str function (str.upper) understands how to apply to each element

data.name.apply(str.upper).head()


#axis = 1 means to apply function across row
#axis = 0 means to apply function across column

Reverse order of first and last name

... and introducing lambda functions

In [None]:
data['lastfirstname'] = data.apply(lambda x:'{0},{1}'.format(x['lastname'], x['firstname']), axis=1).head()