# Data Import and Processing

Datasets are essential to any data science project! The more data you have, the easier it will be to identify relationships between features. However, it is also essential for the datasets to be understood by the computer before you can conduct any data analysis. Thus, the main objective of this exercise is to equip you with the required skills to import and process your dataset before any data analysis or machine learning is conducted.

## 1. Data Import

There are many websites which you can obtain data for free. Some examples of these include Kaggle (https://www.kaggle.com/) and University of California, Irvine (https://archive.ics.uci.edu/ml/datasets.html/) (UCI). We can manually download the datasets and place them in new folders on our computers. However, it may be time consuming to do so. Thus, here is a neat little trick to automate this process! The script is labelled as magic.py. Try it out!

For the script to work, make sure you have the os, wget, pandas and matplotlib library installed in your python virtual environment.

If you encounter an error while running the cell below, please comment out the first line: #%matplotlib qt

In [1]:
#%matplotlib qt
%run "magic.py"


ERROR:root:File `'magic.py'` not found.


Hooray! You have successfully downloaded the data and plotted a graph without any manual intervention. Without opening the magic.py file, are you able to deduce where the data was downloaded to? The printed statements above will provide some hint!

<font color=blue>Bonus: Does the figure look correct? Are you able to explain the negative values and the black lines on the x-axis?</font>

## 1.1 Downloading the Pokemon dataset

Now it is time to import a dataset on your own. The dataset to be used will be the Pokemon Image dataset.  Please spend some time going through the dataset description before attempting the next set of instructions.

In [14]:
df = pd.read_csv(r'C:\Users\rakes\Downloads\pokemon.csv')
print(df.head())

FileNotFoundError: ignored

Create a new folder to store the dataset. Write a code below to download the dataset automatically using the urllib.request.urlretrieve function to help you. You can use the code within magic.py as reference.

To access the contents within magic.py, find the magic.py file in the folder. Right-click on it and open it with wordpad.

Use the URL: "http://sl2files.sustainablelivinglab.org/PokeIMG.zip" and save it as "PokeIMG.zip"

<font color=blue>Bonus: Download the data using only 2 lines of code!</font>

Well done! We now have our data downloaded! ** Make sure to extract the files from the zip file into the directory!! **


We will now access our data and learn some of its features. To do this, let’s explore a Python library called ‘pandas’!

## 1.2 Introduction to Pandas

Pandas is a powerful tool to import datasets. It organises data into an easily processed [dataframe](https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python) which allows for easy statistical analysis. 

Read this [article](https://towardsdatascience.com/a-quick-introduction-to-the-pandas-python-library-f1b678f34673) and watch this [video](https://www.youtube.com/watch?v=dcqPhpY7tWk) for a quick introduction to pandas: What they are, what are some applications of pandas, and how you can use it.

Pay careful attention to the part about importing data and viewing data, as we will use some of the functions in our exercises later!

Summarise what you learnt about Pandas in your worksheet.
-  How do you install and use Pandas?
-  What are the common type of files that Pandas is used for?
-  What is a dataframe?
-  How do you access the rows and columns in the dataframe?
-  Name and describe some commonly used Pandas functions.

ANSWER:
- conda install pandas or pip install pandas into the python virtual environment. Remember to import pandas in the notebook.
- Pandas can be used for both CSV files and Excel spreadsheets.
- Dataframes are 2D data structures that have rows and columns. Dataframes are similar to how data are presented in Excel spreadsheets.
- Rows and columns can be accessed through their names or their numbers. For example, dataframe['petal_size'] can be used to access data within the column that is labelled as "petal_size". Alternatively, dataframe.iloc[1] accesses data within the second row of the dataframe.
- dataframe.head(): Returns the top few rows of the dataframe
- dataframe.shape: Returns the dimensions of the dataframe (number of rows and columns)
- dataframe.fillna(): Fills missing values with given values
- dataframe.describe(): Returns basic statistics of the dataframe
- dataframe.info(): Returns the type of data within each column

Now, let us use some functions within Pandas to help us access data. The first step is to import Pandas. Try importing pandas as pd.

In [4]:
import numpy as np


After importing Pandas, we will now try to read in the Iris Flower dataset. It is currently saved as a Comma Separated Values file (CSV). We will need to understand more about CSV files before we can access the data in them.

## 1.2.1 Comma Separated Values (CSV) files

Datasets are mainly stored in CSV files. CSV files contain data that are separated by comma characters or other characters. For example, a CSV file containing names of people may be stored as John,Mary,Harry,Luke. The comma between the names will tell the computer where to separate one name from the other.

The files usually have a .csv extension but there are files which do not follow this extension. One example will be that of the iris data. 

See this [article](https://www.howtogeek.com/348960/what-is-a-csv-file-and-how-do-i-open-it/) to find out more about csv files: What are they? How to access them?

In [5]:
import pandas as pd

<font color=blue>Bonus: After understanding the nature of CSV files, how would one check whether the data file is a CSV file? Which python function can be used to do this?</font>

## 1.2.2 Pokemon image dataset

The Iris Flower dataset is a csv file, even though it has the extension .data. Now, open the dataset using the pd.read_csv() function and assign it into a variable df. Then, print out the first 5 rows of the dataframe to see the data attribute. What do you notice?

In [9]:
df = pd.read_csv(r'C:\Users\rakes\Downloads\pokemon.csv')
print(df.head())

FileNotFoundError: ignored

In [None]:
df = pd.read_csv("pokemon.csv")
print(df.head(12))

NameError: ignored

Did you realise the dataframe was missing headers/column names? This happens as the original file does not have header/column names. As such, it is always important to find out more details about the data file before using it. The required header name is 'Name'. 

Now, let us try to include the names into the dataframe. It is necessary to read the data into the dataframe again to specify that the data has missing headers. This will allow us to add the names into the dataframe later. Fill in the blank of the missing header.

In [None]:
#Answer
df = pd.read_csv("pokemon.csv",header=None)
print(df.head(10))
names = []


NameError: ignored

With the proper labels, you can now use pandas to obtain basic information (Number of rows and columns, type of data, number of missing values and basic statistics) about the dataset. Use .info() and .describe() to obtain basic information about the dataset!

Based on the information obtained, you should note that there are 150 different flowers in the dataset and that there are no missing values in the dataset.

## You have now mastered the ability to download datasets automatically and import them using Pandas. Additionally, you have also learnt how to use the Pandas functions to obtain basic information about the dataset. Now we will proceed to a class activity where you will have to put all these skills to good use!