Welcome to Data Science in the Wild! In today's tutorial, we'll start by reviewing some basics in Python that we'll use along with several packages to load, clean, analyse, and plot a dataset.

1. Print statements and simple calculations

First, let's review the print() function by introducing ourselves. Fill in the following line of code with your information, and then click "Run."

In [None]:
print('Hello, world! My name is [[enter your name]] and I am a PhD student at [[enter your affiliation]].')

We can also use print statements for calculations. Let's have Python do some basic maths:

In [None]:
print(40/2)
print(40//2)
print(100*845)
print(5-8+4)
print(5-(8+4))

The final two equations show us that Python follows the order of operations. We won't go over maths today, but remember this in your calculations.

Do you notice anything else about the outputs for these equations? 

2. Data types and values

Python stores information as strings, integers, and floats. 

-Strings are enclosed in quotation marks and are often used for textual data, like 'Hello, world!' in our first example. 
-Integers are whole numbers without decimal points. Integers can be positive or negative. 
-Floats, AKA "floating-point numbers," contain decimal points.

Sometimes, you can run into mixed variable type errors. Here's an example:

In [None]:
print('20' + 20)

To fix these errors, you can check the type of an input. You can also convert between data types.

In [None]:
print(type(20.0))
print(type(84500))
print(type('Hello, world!'))

print(int(20.0))
print(float(84500))
print(str(547))

In [None]:
my_float=89.76
my_int=140
my_str='Today is Friday.'

After storing a variable, we can reference it again by name. We can also combine stored variables with new information.

In [None]:
print('What day is today?' + my_str)
print('What day is today?' + ' ' + my_str)

my_float + my_int

3. Storing data: lists, dictionaries, and dataframes

There are many ways to store information in Python. A list stores multiple items in one variable. 

In [None]:
coffee_list=['espresso', 'latte', 'macchiato', 'mocha', 'americano']
print(coffee_list[2])
print(coffee_list[-1])

A dictionary stores information in key:value format. 

In [None]:
city_dict={'Tokyo':'Japan', 'Toronto':'Canada', 'Istanbul':'Turkey', 'Mexico City':'Mexico', 'Cairo':'Egypt'}
city_dict.update({'Paris':'France'})
print(city_dict)

We can also store information in a dataframe. For this, we'll need to call the DataFrame function in pandas. 


In [None]:
import pandas as pd
shopping_list=['potatoes', 'carrots', 'apples', 'onions', 'tomatoes']
food_quantity=[3, 5, 2, 4, 10]
df=pd.DataFrame(shopping_list, food_quantity)
print(df)

The dataframe holds our information, but it still needs some work. Let's add an index and column titles.

In [None]:
df.reset_index(inplace = True, drop = False)
df=df.rename(columns={'index':'Quantity', 0:'Food'})
print(df)

Some other useful dataframe functions are as follows--try them on your own:

In [None]:
print(df.drop(columns='Food'))
    #removes the selected column

In [None]:
print(df.info())
    #column info, dimensions, types of data in df

In [None]:
print(df.iloc[1:3])
    #slices dataframe from index position 1 up until (but *not* including!) position 3-->AKA positions 1 & 2 only

In [None]:
print(df.head(2))
    #slices first 2 rows from dataframe

In [None]:
print(df.tail(2))
    #slices final 2 rows from dataframe

In [None]:
print(df['Food'].sort_values(ascending=False))
    #sorts the 'Food' column reverse alphabetically. also applies to numerical values in float and int data types

4. Conditionals, counters, comparisons, and loops



In [None]:
for drink in coffee_list:
    if drink=='latte':
        print("I'll have a latte.")
    else:
        print("I'll have a cup of tea.")

In [None]:
for key in city_dict:
  print(key + " is a city in " + city_dict[key] + ".")

In [None]:
integer_list=[20, 40, 60, 80, 100]
counter = 0

for i in integer_list:
    if i + 15 < 90:
        counter += 1
print(counter)

In [None]:
animals = ['cat', 'dog', 'ferret', 'goldfish', 'gecko', 'rabbit']

'tortoise' in animals

In [None]:
tortoise=False

if tortoise:
    print ('I have a pet tortoise.')
elif not tortoise:
    print ("I don't have a pet tortoise.")

These expressions can be useful for checking for information and errors in complex dataframes. We'll try some simple exercises using the df we created above.

In [None]:
fruits_and_veg=['bananas', 'apples', 'potatoes', 'onions']
counter=0

for i in df['Food']:
    if i in fruits_and_veg:
        counter += 1
    else:
        print("We don't need to buy any " + i + " this week.")
print("There are " + counter + " items on the shopping list this week.")



Can you diagnose the error above? How should we fix it?

In [None]:
print("There are " + str(counter) + " items on the shopping list this week.")

Part II: Wrangling Data

A. Importing and Exploring Data

We'll start by importing the packages needed for today's lesson. Then, we'll use pandas to load our data to a dataframe, which we'll call "df."

In [None]:
import pandas as pd
import numpy as np
import re
import seaborn as sns
import matplotlib.pyplot as plt

df=pd.read_csv('/Users/jessicawitte/DCODE_Data.csv', encoding='utf-8')

After you load a dataset, you can get a summary of its contents with .info(). This is a great feature to remember if you ever find yourself working with extremely large datasets that can't be loaded into Excel or Google Docs.

.iloc[] allows you to print a "slice" of the dataframe. In this example, we are printing the fifth row of the dataframe. You can also print ranges (e.g. df.iloc[5:10]) or slice from the end of the dataframe (df.iloc[-5])

In [None]:
df.info()

In [None]:
df.iloc[5]

From the output of these lines, what can you observe about the dataset? For instance:

1. How many rows and columns are present?
2. What does the "Non-Null Count" mean?
3. What types of data are in the columns?
4. Are the columns labelled? Are the labels appropriate?
5. Do you see any evidence of errors or gaps in the data?
6. Is the formatting uniform/standard?

Datasets are unique, and so the cleaning process can't be entirely standardized for every situation. The questions above can help you determine what needs to be adjusted on your dataset before you can move on to the next phase of your project.



We can print information about a column with .describe(). Note that the output differs depending on the dtype stored in the column:

*A quick note about "dtypes"*: dtypes are not *exactly* equivalent to "types" in Python. We won't get into specifics in this tutorial, but the differences have to do with how the computer reads pixels in complex data structures. 
Like type errors, dtype errors are very common, so you might need to convert between them. dtypes int64 and float64 are similar to types int and float; dtypes of the object class typically store textual data, like tupe str.

In [None]:
df['Area of UK'].describe()

In [None]:
df['love-island-series'].describe()

B. Cleaning Data

Now that we have an idea of what our data contains, let's do some basic cleaning. Cleaning processes will differ depending upon the next steps in our analysis. For instance, cleaning data for text mining often involves stemming and lemmatizing textual data. For our purposes, we're going to be plotting our data. This means that we need to standardize numerical data, perform simple calculations, remove irrelevant data, and check for errors.  First, let's check for any null or NaN values. We won't be able to graph numerical data that contains null values, so this step is important.

In [None]:
print(df.isnull().sum().sum())

df.isna().any()[lambda x: x]

#https://www.w3schools.com/python/python_lambda.asp is a great resource for more about the lambda function. 
#In short, this is a small function contained to one line, meaning you can't call it again without running
#the same code. Lambda functions are helpful in text cleaning, when you're performing a quick check that 
#won't be repeated.

So, there are 202 null values in six columns.


Let's try performing a calculation with null values.

In [None]:
df['Number of dates'].iloc[0:5].mean()

We can replace NaNs with 0s. 

In [None]:
df=df.replace(np.nan,0)
df.info()

Now, no null values exist. But it looks like replacing the NaN values have changed the dtypes of the affected columns.
Let's check the output of one of the columns and standardize our dtypes.

In [None]:
df.iloc[10]

In [None]:
for column in df.columns:
    if df[column].dtype == 'float64':
        df[column] = df[column].astype(np.int64)    

In [None]:
df.info()

The column titles in our dataset could be improved. We'll also want to standardize the capitalization in our data. We'll capitalize the first word of data in our 
(note: see more pandas functions related to capitalization here https://pandas.pydata.org/docs/reference/api/pandas.Series.str.capitalize.html)


In [None]:
def standard_caps(df):
    df = df.apply(lambda x: x.str.title() if x.dtype == "object" else x) 
    return df


In [None]:
df=standard_caps(df)
df.iloc[10]

Let's also address the capitalization in the column titles:

In [None]:
df.columns = map(str.title, df.columns)
df.info()

We can see what's in a column by printing df['column title'], but sometimes the output is too large to be displayd in Python. We can use .unique() to print one instance of each value in a column.

In [None]:
jobs=df['Profession'].unique()
print(jobs)

In [None]:
ages=df['Age'].unique()
print(ages)

In [None]:
cities=df['From'].unique()
print(cities)

Take a look the output of "cities." Do you see any errors that we should address before moving on to our plots?

In [None]:
df.loc[df.isin(['Scotland']).any(axis=1)].index.tolist()
#Python looks for 'Scotland' anywhere in the dataframe and prints the location in the format [row, col]. So, we 
#know the error is in row 28.


In [None]:
df.iloc[28]

In a small dataset, it's easy to manual correct errors. In this case, we can try Google to learn where in Scotland the contestant was born. According to Wikipedia, she is from Dumfries. We can manually correct this in the dataframe.

In [None]:
df.loc[28, 'From'] = 'Dumfries'
df.iloc[28]

For our final step in cleaning today, we'll edit the column names so they're more concise.

In [None]:
df=df.rename(columns={"Day Left Villa": "Last Day", "Day Joined Villa": "First Day", 
                      "Was Longest Couple Final Couple":"Longest/Final Couple", 
                      "Love-Island-Series":"Series Year"})
df.info()

We have a list of the contestants' ages, and we can calculate formulas such as standard deviation and mean computationally:

In [None]:
df['Age'].describe()

But for certain research questions, we need to know the frequency of the distribution of the data. We can do this with .value_counts(), which creates a frequency table of unique values as an array. With a few more lines of code, we can format the array as a dataframe.

In [None]:
age_counts=df['Age'].value_counts()
df_ages=pd.DataFrame(data=age_counts)
df_ages=df_ages.reset_index(drop=False)
df_ages=df_ages.rename(columns={'index':'Age', 'Age':'Count'})

print(df_ages)

Part C: Plotting Data

We're ready for our first plot, which we'll format as a simple barplot in Seaborn. (note: Seaborn, which runs on top of MatPlotLib, can create very interesting visualizations for a variety of datasets. Here is a gallery: https://seaborn.pydata.org)

In [None]:
sns.barplot(data=df_ages, x="Age", y="Count")
plt.title('Ages of "Love Island" Contestants')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

Barplots, also known as histograms, are good for categorical data (in other words, sorting data into categories and measuring the size, percentage, or quantity or each category). Even though we are organising our data by number, each age functions as a category in this case. 

Let's try plotting numerical data. We'll use Seaborn to make a scatterplot to see whether the amount of time a contestant spent in the villa affected how many dates they got during the series. 

In [None]:
sns.relplot(x="Number Of Days In Villa", y="Number Of Dates", hue="Series Year", alpha=.5, palette="bright",
            height=6, data=df)
plt.title('Days vs. Dates')
plt.show()

We can also change the size of the scatter points as another variable in Seaborn.

In [None]:
sns.relplot(x="Number Of Days In Villa", y="Number Of Dates", hue="Series Year", 
            size='Number Of Challenges Won', sizes=(40,400), alpha=.5, 
            height=6, data=df)
plt.title('Days vs. Dates')
plt.show()