**Data prep**

**2022: Week 4 The Prep School - Travel Plans**

<a href = "https://preppindata.blogspot.com/2022/01/2022-week-4-prep-school-travel-plans.html" >Data source and requirements </a>

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')


**Input the csv file**

In [2]:
data1= pd.read_csv("PD 2022 Wk 1 Input - Input2.csv")
data1.head(10)

Unnamed: 0,id,pupil first name,pupil last name,gender,Date of Birth,Parental Contact Name_1,Parental Contact Name_2,Preferred Contact Employer,Parental Contact
0,1,Ronna,Nellies,Female,12/21/2013,Purcell,Ketti,Demizz,1
1,2,Rusty,Andriulis,Male,7/21/2012,Vassili,Rivi,Brainbox,1
2,3,Roberta,Oakeshott,Female,12/4/2011,Lind,Haskell,Centidel,2
3,4,Lola,Rubinfajn,Male,6/29/2012,Elie,Tresa,Edgeblab,2
4,5,Kamila,Benedtti,Female,7/10/2012,Adela,Clevey,Trudoo,1
5,6,Avery,Colebourn,Female,8/30/2012,Dalenna,Charley,Linktype,1
6,7,Valentino,Klimko,Female,12/23/2014,Arlette,Onofredo,Thoughtblab,2
7,8,Cal,Shearwood,Male,1/18/2015,Leontine,Berne,Browseblab,2
8,9,King,Truswell,Female,9/14/2012,Evvy,Othelia,Photospace,1
9,10,Towney,Stichall,Male,6/4/2015,Wendie,Joyann,Kwimbee,2


In [3]:
data2= pd.read_csv("PD 2021 WK 1 to 4 ideas - Preferences of Travel.csv")
data2.head(10)

Unnamed: 0,Student ID,M,Tu,W,Th,F
0,1,Car,Car,Car,Car,Bycycle
1,2,Bicycle,Bicycle,Bicycle,Walk,Walk
2,3,Car,Bicycle,Carr,Walk,Car
3,4,Scooter,Scooter,Scootr,Scooter,Scoter
4,5,Bycycle,Carr,Scoter,Walkk,Scoter
5,6,Car,Car,Car,Car,Car
6,7,Walk,Walk,Wallk,Walk,WAlk
7,8,Car,Walk,Bicycle,Walk,Walk
8,9,Aeroplane,Aeroplane,Aeroplane,Aeroplane,Aeroplane
9,10,Car,Walk,Car,Walk,Car


**Join the data sets together to give us the grades per student**

In [4]:
data=pd.merge(data1,data2, left_on='id', right_on='Student ID')

**Remove the parental data fields, they aren't needed for the challenge this week**

In [5]:
data = data.drop(columns=["Parental Contact Name_1", "Parental Contact Name_2", "Preferred Contact Employer", "Parental Contact"])

**Change the weekdays from separate columns to one column of weekdays and one of the pupil's travel choice**

In [6]:
data = pd.melt(data, id_vars=['Student ID','pupil first name','pupil last name','gender','Date of Birth'], value_vars=["M","Tu","W","Th","F"],
                   var_name="Weekday", value_name='Method of Travel')
data

Unnamed: 0,Student ID,pupil first name,pupil last name,gender,Date of Birth,Weekday,Method of Travel
0,1,Ronna,Nellies,Female,12/21/2013,M,Car
1,2,Rusty,Andriulis,Male,7/21/2012,M,Bicycle
2,3,Roberta,Oakeshott,Female,12/4/2011,M,Car
3,4,Lola,Rubinfajn,Male,6/29/2012,M,Scooter
4,5,Kamila,Benedtti,Female,7/10/2012,M,Bycycle
...,...,...,...,...,...,...,...
4995,996,Ninetta,Worling,Female,2/15/2015,F,WAlk
4996,997,Stanford,Tinton,Female,4/9/2013,F,Walk
4997,998,Ertha,MacCook,Male,12/14/2013,F,Bicycle
4998,999,Lawton,Randles,Female,12/12/2011,F,Walk


**Group the travel choices together to remove spelling mistakes**

In [7]:
data['Method of Travel'].value_counts()

Car                1586
Walk               1035
Bicycle             710
Scoter              252
Scooter             252
Walkk               176
Carr                169
Bycycle             162
Van                 162
Scootr               84
WAlk                 84
Wallk                84
Helicopter           72
Mum's Shoulders      46
Aeroplane            45
Dad's Shoulders      24
Helicopeter          24
Waalk                24
Skipped               3
Hopped                3
Jumped                3
Name: Method of Travel, dtype: int64

In [8]:
data['Method of Travel'] = data['Method of Travel'].replace(['Walkk','Wallk','WAlk','Waalk'],'Walk')
data['Method of Travel'] = data['Method of Travel'].replace('Carr','Car')
data['Method of Travel'] = data['Method of Travel'].replace(['Scoter','Scootr'],'Scooter')
data['Method of Travel'] = data['Method of Travel'].replace('Helicopeter','Helicopter')
data['Method of Travel'] = data['Method of Travel'].replace('Bycycle','Bicycle')

**Create a Sustainable (non-motorised) vs Non-Sustainable (motorised) data field** <br>
**Scooters are the child type rather than the motorised type**

In [9]:
data.loc[(data['Method of Travel'] =='Walk') | (data['Method of Travel'] =='Bicycle') | (data['Method of Travel'] =='Scooter')| (data['Method of Travel'] =='Jumped')| (data['Method of Travel'] =='Skipped')| (data['Method of Travel'] =='Hopped') | (data['Method of Travel'] =="Dad's Shoulders") | (data['Method of Travel'] =="Mum's Shoulders"), 'Sustainable?'] = 'Sustainable'  
data.loc[(data['Method of Travel'] =='Car') | (data['Method of Travel'] =='Van') | (data['Method of Travel'] =='Helicopter')| (data['Method of Travel'] =='Aeroplane'), 'Sustainable?'] = 'Non-Sustainable'


**Total up the number of pupil's travelling by each method of travel**

In [10]:
df=pd.DataFrame(data.groupby(["Sustainable?", "Method of Travel", "Weekday"])['Student ID'].count().reset_index())
df.columns=["Sustainable?", "Method of Travel", "Weekday","Number of Trips"]


**Work out the % of trips taken by each method of travel each day <br>
Round to 2 decimal places**

In [11]:
df["Trips per day"] = 1000
df["% trips per day"]=(df["Number of Trips"]/df["Trips per day"]).round(decimals = 2)

**Remove any unnecessary columns of data**

**Output the data**

In [12]:
df

Unnamed: 0,Sustainable?,Method of Travel,Weekday,Number of Trips,Trips per day,% trips per day
0,Non-Sustainable,Aeroplane,F,9,1000,0.01
1,Non-Sustainable,Aeroplane,M,9,1000,0.01
2,Non-Sustainable,Aeroplane,Th,9,1000,0.01
3,Non-Sustainable,Aeroplane,Tu,9,1000,0.01
4,Non-Sustainable,Aeroplane,W,9,1000,0.01
5,Non-Sustainable,Car,F,254,1000,0.25
6,Non-Sustainable,Car,M,422,1000,0.42
7,Non-Sustainable,Car,Th,302,1000,0.3
8,Non-Sustainable,Car,Tu,364,1000,0.36
9,Non-Sustainable,Car,W,413,1000,0.41


In [13]:
#Output the data 
df.to_csv('PD 2022 Week 4 Output.csv', index=False)


In [14]:
print("Data Prepped!")

Data Prepped!
