# Data Wrangling
***********
The initial data set (li_dirtydata.csv) was gathered and structured manually using Microsoft Excel. To provide context for the origin of this dataset, I have outlined the work completed outside of this repository below.

<br>

Utilized Power Query to import data tables for seasons 1-9 from Love Island Wiki.
- Imported table columns: Name, Age, Hometown, Day Entered, Status.  

<br>

Inserted a new column "Season" for each table to identify the season of each contestant appeared on.  
- Combined the data from all seasons into one table.  

<br>

Text to Columns feature to seperate data and form new columns: 
- Hometown split into Hometown and Region.
- Status split into Status and Day Dumped. 

<br>

Find and Replace feature to format and clean data:
- Removed blank spaces from cells for consistency.
- Removed non-integers from Day Entered and Day Dumped columns.
- Normalized names containing accented characters to regular English characters for compatibility with pandas dataframes.

<br>

Manually created and gathered data for following columns: 
- Height, Hair Color, Eye Color, Ethnicity, Length of Stay, Original Cast, Casa Amor Entry, Unique Partners, and Finalist.

<br>
The resulting CSV file contains all historical data on Love Island contestants. 

Please refer to the [Data Dictionary](linktoDataDictionary) for detailed descriptions of each data field and access to the data sources.

**********

### Import Data

In [None]:
import pandas as pd

In [None]:
import os
cwd = os.getcwd()

In [None]:
data = pd.read_csv(cwd +'/data/li_dirtydata.csv')

In [None]:
data.head()

### Check for duplicates

In [None]:
duplicates = data[data.duplicated('Name', False)]

duplicates

Per the output in the cell above, Adam Collard appeared on the show in two seperate seasons. I will retain both of his stays in the villa for this data set. <br>
Apart from this special case, there are no duplicates in our data. 

### Missing Values 

In [None]:
#Sample of Season 1 and 2 columns, showing Casa Amor Addition as "Null"
no_casa = data[data["Casa Amor Addition"].isin(['Null'])]

no_casa.head()

Another noteworthy point is that Casa Amor was not introduced until Season 3 of the show. <br>
As shown above, Season 1 and 2 contestants have 'Null' as their Casa Amor Addition value. <br>
I will replace 'Null' with 'No' for consistency.

In [None]:
data['Casa Amor Addition'] = data['Casa Amor Addition'].replace('Null', 'No')

### Status Column Edge Cases

In [None]:
#Create new dataframe where Status column values are not our desired values
status_check = data[~data['Status'].isin(['1','2','3','4','Dumped'])]

status_check

The contestants shown above had values 'Walked' or 'Removed' in their "Status" column. <br>
'Walked' indicates the contestant decided to leave the villa, while 'Removed' indicates the candidate was cut from the show by producers. <br>
I am going to replace these values to 'Dumped' for consistency. This will ensure that contestants who did not make it to the finale are all identified by the same value in the "Status" column.

In [None]:
data['Status'] = data['Status'].replace('Walked', 'Dumped')
data['Status'] = data['Status'].replace('Removed', 'Dumped')

### Column Names

When creating this dataset in excel, I was explicit in naming columns for sake of clarity. <br>
Column names are shortened/simpified below for ease of use in analysis/processing/modeling.

In [None]:
#Create dictionary of new columns names
new_columns = {'Hair Color': 'Hair',
               'Eye Color': 'Eye',
               'Day Entered': 'Entered',
               'Day Dumped': 'Dumped',
               'Length Of Stay': 'Stay',
               'Original Cast': 'OG',
               'Casa Amor Addition': 'Casa',
               'Unique Partners': 'Couples'}

#Rename columns specified in dictionary above
data = data.rename(columns=new_columns)

In [None]:
#Export new data to csv
data.to_csv(cwd + '/data/li_cleandata.csv', index=False)

In [None]:
#Check new csv file
check = pd.read_csv(cwd + '/data/li_cleandata.csv')

check.head()

Now that the data wrangling and cleaning are completed, I can save the revised dataset and move onto the EDA notebook.