# Run-2016-Wrangle

Here I will be cleaning the dataset 'Career_Stats_Rushing.csv'. This dataset can be found [here](https://www.kaggle.com/kendallgillies/nflstatistics).

In [1]:
# import all packages and set plots to be embedded inline. Also, set all columns and rows to be displayed.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)


%matplotlib inline

### Various Functions

## Gather

In [2]:
# Read rushing stats into csv file and create copy for wrangling.
df_original = pd.read_csv('Career_Stats_Rushing.csv')
df = df_original.copy()

## Assess 1 and 2

(1) Multiple data points contain the symbol '--' to indicate missing data.<br> 
(2) The columns 'Rushing Attempts', 'Rushing Yards', 'Yards Per Carry', 'Rushing Yards Per Game', 'Rushing TDs', 'Longest Rushing Run', 'Rushing First Downs', 'Percentage of Rushing First Downs', 'Rushing More Than 20 Yards', 'Rushing More Than 40 Yards', and 'Fumbles' are of the data type 'object'.<br>

### Assessments 1 and 2

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17507 entries, 0 to 17506
Data columns (total 18 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Player Id                          17507 non-null  object 
 1   Name                               17507 non-null  object 
 2   Position                           2319 non-null   object 
 3   Year                               17507 non-null  int64  
 4   Team                               17507 non-null  object 
 5   Games Played                       17507 non-null  int64  
 6   Rushing Attempts                   17507 non-null  object 
 7   Rushing Attempts Per Game          17507 non-null  float64
 8   Rushing Yards                      17507 non-null  object 
 9   Yards Per Carry                    17507 non-null  object 
 10  Rushing Yards Per Game             17507 non-null  object 
 11  Rushing TDs                        17507 non-null  obj

## Clean - Assessments 1 and 2

I will be cleaning the first two issues I found because they make further explortation more difficult.

#### Define - Assessment 1

I will use the pandas replace function to replace all of the -- symbols with NaN.

#### Code - Assessment 1

In [4]:
# Replace all instances of '--' with np.NaN throughout dataframe
df.replace('--', np.NaN, inplace = True)

In [5]:
df.columns.tolist()

['Player Id',
 'Name',
 'Position',
 'Year',
 'Team',
 'Games Played',
 'Rushing Attempts',
 'Rushing Attempts Per Game',
 'Rushing Yards',
 'Yards Per Carry',
 'Rushing Yards Per Game',
 'Rushing TDs',
 'Longest Rushing Run',
 'Rushing First Downs',
 'Percentage of Rushing First Downs',
 'Rushing More Than 20 Yards',
 'Rushing More Than 40 Yards',
 'Fumbles']

#### Test - Assessment 1

In [6]:
# How many times '--' exists anywhere in dataframe.
df.isin(['--']).any().sum()

0

#### Define - Assessment 2

I will convert all of the columns that are oft the wrong data type to the int data type.

#### Code - Assessment 2

In [7]:
#Make list of all columns to change, then apply list to the to_numeric function
wrong_type = ['Rushing Attempts', 'Rushing Yards', 'Yards Per Carry', 'Rushing Yards Per Game', 
              'Rushing TDs', 'Longest Rushing Run', 'Rushing First Downs', 
              'Percentage of Rushing First Downs', 'Rushing More Than 20 Yards', 'Rushing More Than 40 Yards', 'Fumbles']

df[wrong_type] = df[wrong_type].apply(pd.to_numeric, errors='coerce')

#### Test - Assessment 2

In [8]:
df[wrong_type].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17507 entries, 0 to 17506
Data columns (total 11 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Rushing Attempts                   11457 non-null  float64
 1   Rushing Yards                      11070 non-null  float64
 2   Yards Per Carry                    11376 non-null  float64
 3   Rushing Yards Per Game             11445 non-null  float64
 4   Rushing TDs                        11457 non-null  float64
 5   Longest Rushing Run                6725 non-null   float64
 6   Rushing First Downs                4875 non-null   float64
 7   Percentage of Rushing First Downs  4851 non-null   float64
 8   Rushing More Than 20 Yards         4875 non-null   float64
 9   Rushing More Than 40 Yards         4875 non-null   float64
 10  Fumbles                            4875 non-null   float64
dtypes: float64(11)
memory usage: 1.5 MB


## Assess 3 onward

(3) Column names have spaces in them.<br>
(4) One row had a value greater than 100 in the column 'Percentage of Rushing First Downs'.

### Assessment 3

In [9]:
df.head()

Unnamed: 0,Player Id,Name,Position,Year,Team,Games Played,Rushing Attempts,Rushing Attempts Per Game,Rushing Yards,Yards Per Carry,Rushing Yards Per Game,Rushing TDs,Longest Rushing Run,Rushing First Downs,Percentage of Rushing First Downs,Rushing More Than 20 Yards,Rushing More Than 40 Yards,Fumbles
0,fredevans/2513736,"Evans, Fred",,1948,Chicago Bears,3,10.0,3.3,15.0,1.5,5.0,0.0,,,,,,
1,fredevans/2513736,"Evans, Fred",,1948,Chicago Rockets,0,,0.0,,,,,,,,,,
2,fredevans/2513736,"Evans, Fred",,1947,Chicago Rockets,0,,0.0,,,,,,,,,,
3,fredevans/2513736,"Evans, Fred",,1947,Buffalo Bills,0,,0.0,,,,,,,,,,
4,fredevans/2513736,"Evans, Fred",,1946,Cleveland Browns,0,,0.0,,,,,,,,,,


### Assessment 4

In [10]:
df[df['Percentage of Rushing First Downs'] > 100]

Unnamed: 0,Player Id,Name,Position,Year,Team,Games Played,Rushing Attempts,Rushing Attempts Per Game,Rushing Yards,Yards Per Carry,Rushing Yards Per Game,Rushing TDs,Longest Rushing Run,Rushing First Downs,Percentage of Rushing First Downs,Rushing More Than 20 Yards,Rushing More Than 40 Yards,Fumbles
363,kealohapilares/2495326,"Pilares, Kealoha",,2012,Carolina Panthers,8,1.0,0.1,17.0,17.0,2.1,0.0,12.0,2.0,200.0,0.0,0.0,0.0


## Clean Assessments 3 and 4

### Define Assessment 3

I will use the pandas replace function to replace all spaces with underscores.

### Code Assessment 3

In [11]:
#Create list of current column names, then use for loop to rename all columns without needing to create long dictionary.
c_list = df.columns.tolist()
for column in c_list:
    df.rename(columns = {column : column.replace(' ', '_')}, inplace = True)

### Test Assessment 3

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17507 entries, 0 to 17506
Data columns (total 18 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Player_Id                          17507 non-null  object 
 1   Name                               17507 non-null  object 
 2   Position                           2319 non-null   object 
 3   Year                               17507 non-null  int64  
 4   Team                               17507 non-null  object 
 5   Games_Played                       17507 non-null  int64  
 6   Rushing_Attempts                   11457 non-null  float64
 7   Rushing_Attempts_Per_Game          17507 non-null  float64
 8   Rushing_Yards                      11070 non-null  float64
 9   Yards_Per_Carry                    11376 non-null  float64
 10  Rushing_Yards_Per_Game             11445 non-null  float64
 11  Rushing_TDs                        11457 non-null  flo

### Define Assessment 4

The stats page for this player can be found [here](https://www.espn.com/nfl/player/stats/_/id/14150/kealoha-pilares) under the rushing section. This page also appears to contain the error. Because I cannot know the validity of any of this player's data, I will delete him altogether.

### Code Assessment 4

In [13]:
# Use pandas drop function to remove row from df.
false_row = df.query('Percentage_of_Rushing_First_Downs > 100').index
df.drop(false_row, inplace = True)

### Test Assessment 4

In [14]:
df.query('Percentage_of_Rushing_First_Downs > 100')

Unnamed: 0,Player_Id,Name,Position,Year,Team,Games_Played,Rushing_Attempts,Rushing_Attempts_Per_Game,Rushing_Yards,Yards_Per_Carry,Rushing_Yards_Per_Game,Rushing_TDs,Longest_Rushing_Run,Rushing_First_Downs,Percentage_of_Rushing_First_Downs,Rushing_More_Than_20_Yards,Rushing_More_Than_40_Yards,Fumbles


In [15]:
# Create master csv file
df.to_csv('Career_Stats_Rushing_master.csv', index = False)