## Wrangling parks data

### Goals of the Task


The parks and recreation data consists of two data sets. 

- The smaller data set contains address, longitude and latitude for Seattle parks (each row is a park). 
- The second data set (features) indicates which facilities a park has (each row is a facility in a park) such as picnic areas, basketball courts and football pitches. 

The aim of this task is to combine and reshape the data into a wide rather than long frame where each row is a park, and there is a Boolean column for each feature type. 

#### Step 1 : use pandas to read the parks and features data files into data frames
- import pandas as pd 
- use pandas read_csv to create a parks data frame and a facilities data frame 
- ensure you are pointing at the correct file path for the data source (you may have to navigate in your notebook!) 


In [1]:
import pandas as pd

In [4]:
parks = pd.read_csv('Seattle_Parks_And_Recreation_Park_Addresses.csv')

In [6]:
features = pd.read_csv('Seattle_Parks_and_Recreation_Parks_Features.csv')

In [13]:
parks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 412 entries, 0 to 411
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   PMAID       412 non-null    int64  
 1   LocID       412 non-null    int64  
 2   Name        412 non-null    object 
 3   Address     412 non-null    object 
 4   ZIP Code    412 non-null    int64  
 5   X Coord     412 non-null    float64
 6   Y Coord     412 non-null    float64
 7   Location 1  412 non-null    object 
dtypes: float64(2), int64(3), object(3)
memory usage: 25.9+ KB


In [14]:
features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1558 entries, 0 to 1557
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PMAID         1558 non-null   int64  
 1   Name          1558 non-null   object 
 2   Alt_Name      222 non-null    object 
 3   xPos          1521 non-null   float64
 4   yPos          1522 non-null   float64
 5   Feature_ID    1558 non-null   int64  
 6   hours         1555 non-null   object 
 7   Feature_Desc  1558 non-null   object 
 8   CHILD_DESC    638 non-null    object 
 9   FIELD_TYPE    216 non-null    object 
 10  YOUTH_ONLY    1558 non-null   bool   
 11  LIGHTING      1558 non-null   bool   
 12  Location 1    1521 non-null   object 
dtypes: bool(2), float64(2), int64(2), object(7)
memory usage: 137.1+ KB


In [15]:
parks.head()

Unnamed: 0,PMAID,LocID,Name,Address,ZIP Code,X Coord,Y Coord,Location 1
0,281,2545,12th and Howe Play Park,1200 W Howe St,98119,-122.372985,47.636097,"(47.636097, -122.372985)"
1,4159,2387,12th Ave S Viewpoint,2821 12TH Ave S,98144,-122.317765,47.577953,"(47.577953, -122.317765)"
2,4467,2382,12th Ave Square Park,564 12th Ave,98122,-122.316455,47.607427,"(47.607427, -122.316455)"
3,4010,2546,14th Ave NW Boat Ramp,4400 14th Ave NW,98107,-122.373536,47.660775,"(47.660775, -122.373536)"
4,296,296,3001 E Madison,3001 E Madison St,98112,-122.293173,47.625169,"(47.625169, -122.293173)"


In [17]:
features.head()

Unnamed: 0,PMAID,Name,Alt_Name,xPos,yPos,Feature_ID,hours,Feature_Desc,CHILD_DESC,FIELD_TYPE,YOUTH_ONLY,LIGHTING,Location 1
0,281,12th and Howe Play Park,,-122.372985,47.636097,22,6 a.m. - 10 p.m.,Play Area,Play Area,,False,False,"1200 W Howe St\n(-122.372985, 47.636097)"
1,4159,12th Ave S Viewpoint,,-122.317765,47.577953,34,6 a.m. - 10 p.m.,View,,,False,False,"2821 12TH Ave S\n(-122.317765, 47.577953)"
2,4010,14th Ave NW Boat Ramp,,-122.373536,47.660775,7,4 a.m. - 11:30 p.m.,Boat Launch (Hand Carry),,,False,False,"4400 14th Ave NW\n(-122.373536, 47.660775)"
3,4010,14th Ave NW Boat Ramp,,-122.373536,47.660775,6,4 a.m. - 11:30 p.m.,Boat Launch (Motorized),,,False,False,"4400 14th Ave NW\n(-122.373536, 47.660775)"
4,4010,14th Ave NW Boat Ramp,,-122.373536,47.660775,36,4 a.m. - 11:30 p.m.,Waterfront,,,False,False,"4400 14th Ave NW\n(-122.373536, 47.660775)"


#### Step 2 : reformat the column headers in lower case 

- the two data sets have some inconsistencies in the header case used on columns so this should be fixed using the str.lower() method. 

    - example : df.columns = df.columns.str.lower() function 

In [20]:
parks.columns=parks.columns.str.lower()

In [21]:
parks.head()

Unnamed: 0,pmaid,locid,name,address,zip code,x coord,y coord,location 1
0,281,2545,12th and Howe Play Park,1200 W Howe St,98119,-122.372985,47.636097,"(47.636097, -122.372985)"
1,4159,2387,12th Ave S Viewpoint,2821 12TH Ave S,98144,-122.317765,47.577953,"(47.577953, -122.317765)"
2,4467,2382,12th Ave Square Park,564 12th Ave,98122,-122.316455,47.607427,"(47.607427, -122.316455)"
3,4010,2546,14th Ave NW Boat Ramp,4400 14th Ave NW,98107,-122.373536,47.660775,"(47.660775, -122.373536)"
4,296,296,3001 E Madison,3001 E Madison St,98112,-122.293173,47.625169,"(47.625169, -122.293173)"


In [23]:
features.columns=features.columns.str.lower()

In [25]:
features.head()

Unnamed: 0,pmaid,name,alt_name,xpos,ypos,feature_id,hours,feature_desc,child_desc,field_type,youth_only,lighting,location 1
0,281,12th and Howe Play Park,,-122.372985,47.636097,22,6 a.m. - 10 p.m.,Play Area,Play Area,,False,False,"1200 W Howe St\n(-122.372985, 47.636097)"
1,4159,12th Ave S Viewpoint,,-122.317765,47.577953,34,6 a.m. - 10 p.m.,View,,,False,False,"2821 12TH Ave S\n(-122.317765, 47.577953)"
2,4010,14th Ave NW Boat Ramp,,-122.373536,47.660775,7,4 a.m. - 11:30 p.m.,Boat Launch (Hand Carry),,,False,False,"4400 14th Ave NW\n(-122.373536, 47.660775)"
3,4010,14th Ave NW Boat Ramp,,-122.373536,47.660775,6,4 a.m. - 11:30 p.m.,Boat Launch (Motorized),,,False,False,"4400 14th Ave NW\n(-122.373536, 47.660775)"
4,4010,14th Ave NW Boat Ramp,,-122.373536,47.660775,36,4 a.m. - 11:30 p.m.,Waterfront,,,False,False,"4400 14th Ave NW\n(-122.373536, 47.660775)"


#### Step 3 : join the data frames together 

- use the pandas merge method to combine the two data frames into a new single data frame
- use the pmaid column as the merge key

https://www.geeksforgeeks.org/merge-two-pandas-dataframes-by-matched-id-number/ 

In [41]:
park_features = pd.merge(parks, features, on = 'pmaid', how='outer')

In [35]:
park_features.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1546 entries, 0 to 1545
Data columns (total 20 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   pmaid         1546 non-null   int64  
 1   locid         1546 non-null   int64  
 2   name_x        1546 non-null   object 
 3   address       1546 non-null   object 
 4   zip code      1546 non-null   int64  
 5   x coord       1546 non-null   float64
 6   y coord       1546 non-null   float64
 7   location 1_x  1546 non-null   object 
 8   name_y        1546 non-null   object 
 9   alt_name      222 non-null    object 
 10  xpos          1510 non-null   float64
 11  ypos          1511 non-null   float64
 12  feature_id    1546 non-null   int64  
 13  hours         1543 non-null   object 
 14  feature_desc  1546 non-null   object 
 15  child_desc    635 non-null    object 
 16  field_type    216 non-null    object 
 17  youth_only    1546 non-null   bool   
 18  lighting      1546 non-null 

In [36]:
park_features.head()

Unnamed: 0,pmaid,locid,name_x,address,zip code,x coord,y coord,location 1_x,name_y,alt_name,xpos,ypos,feature_id,hours,feature_desc,child_desc,field_type,youth_only,lighting,location 1_y
0,281,2545,12th and Howe Play Park,1200 W Howe St,98119,-122.372985,47.636097,"(47.636097, -122.372985)",12th and Howe Play Park,,-122.372985,47.636097,22,6 a.m. - 10 p.m.,Play Area,Play Area,,False,False,"1200 W Howe St\n(-122.372985, 47.636097)"
1,4159,2387,12th Ave S Viewpoint,2821 12TH Ave S,98144,-122.317765,47.577953,"(47.577953, -122.317765)",12th Ave S Viewpoint,,-122.317765,47.577953,34,6 a.m. - 10 p.m.,View,,,False,False,"2821 12TH Ave S\n(-122.317765, 47.577953)"
2,4010,2546,14th Ave NW Boat Ramp,4400 14th Ave NW,98107,-122.373536,47.660775,"(47.660775, -122.373536)",14th Ave NW Boat Ramp,,-122.373536,47.660775,7,4 a.m. - 11:30 p.m.,Boat Launch (Hand Carry),,,False,False,"4400 14th Ave NW\n(-122.373536, 47.660775)"
3,4010,2546,14th Ave NW Boat Ramp,4400 14th Ave NW,98107,-122.373536,47.660775,"(47.660775, -122.373536)",14th Ave NW Boat Ramp,,-122.373536,47.660775,6,4 a.m. - 11:30 p.m.,Boat Launch (Motorized),,,False,False,"4400 14th Ave NW\n(-122.373536, 47.660775)"
4,4010,2546,14th Ave NW Boat Ramp,4400 14th Ave NW,98107,-122.373536,47.660775,"(47.660775, -122.373536)",14th Ave NW Boat Ramp,,-122.373536,47.660775,36,4 a.m. - 11:30 p.m.,Waterfront,,,False,False,"4400 14th Ave NW\n(-122.373536, 47.660775)"


#### step 4: drop unneccesary columns

the columns we want to keep in the resulting data frame are 

- zip code
- x coord
- y coord
- locid (location id) 
- name (park name) 
- pmaid (park id) 
- feature_id (facility id) 
- feature_desc (facility description)

drop all remaining columns

In [43]:
park_features.drop(['address','location 1_x','name_y','xpos','ypos','hours','child_desc','field_type',
                    'youth_only','lighting','location 1_y'], axis = 1, inplace=True)

In [44]:
park_features

Unnamed: 0,pmaid,locid,name_x,zip code,x coord,y coord,alt_name,feature_id,feature_desc
0,281,2545.0,12th and Howe Play Park,98119.0,-122.372985,47.636097,,22.0,Play Area
1,4159,2387.0,12th Ave S Viewpoint,98144.0,-122.317765,47.577953,,34.0,View
2,4467,2382.0,12th Ave Square Park,98122.0,-122.316455,47.607427,,,
3,4010,2546.0,14th Ave NW Boat Ramp,98107.0,-122.373536,47.660775,,7.0,Boat Launch (Hand Carry)
4,4010,2546.0,14th Ave NW Boat Ramp,98107.0,-122.373536,47.660775,,6.0,Boat Launch (Motorized)
...,...,...,...,...,...,...,...,...,...
1659,412,,,,,,,15.0,Fishing
1660,412,,,,,,,15.0,Fishing
1661,412,,,,,,,18.0,Historic Landmark
1662,412,,,,,,,36.0,Waterfront


#### step 5: examine and clean the feature column

- examine the feature_desc column using the pandas function unique()
- note that this column contains a description of just one facility that a park contains
- this means each park has multiple rows (one row for each park facility)
- in some cases you will also see duplicates- this is due to the presence of columns you removed earlier
- for example, Alki Beach Park (PMAID 445)  has 
    - 2 x boat launches (hand carry)
    - a fire pit
    - 2 x paths
    - picnic sites
    - 2 x restrooms
    - a view
    - a waterfont
- first, de duplicate the data frame to remove duplicate feature listings
- remember to reset the index of your data frame after dropping duplicate rows

#### step 6 : turn the feature column into multiple boolean facility 1/0 columns

- we want a list of parks alongside columns for all the possible features, showing which feature each park contains
- there are 68 feature described in total, and you will see that some features are very similar (eg basketball(full)/ basketball(half)) so OPTIONALLY you can pause here to reduce those features using text analysis methods you learnt in topic 8. 
- use the pandas pivot_table method to pivot the feature desciption column into multiple columns which will change the shape of the data from long to wide

    - example:  pd.pivot_table(df, index=[park], columns=[feature],aggfunc="count")

- replace the NaN entries in the resulting df with 0 with the pandas fillna() method 

#### Step 7: validate the data
- use EDA techniques including visualisation to validate the reshaping process 