# Data Import and Processing

Datasets are essential to any data science project! The more data you have, the easier it will be to identify relationships between features. However, it is also essential for the datasets to be understood by the computer before you can conduct any data analysis. Thus, the main objective of this exercise is to equip you with the required skills to import and process your dataset before any data analysis or machine learning is conducted.

## 1. Data Import

There are many websites which you can obtain data for free. Some examples of these include Kaggle (https://www.kaggle.com/) and University of California, Irvine (UCI) (https://archive.ics.uci.edu/ml/datasets.html). We can manually download the datasets and place them in new folders on our computers. However, it may be time consuming to do so. Thus, here is a neat little trick to automate this process! The script is labelled as magic.py. Try it out!

For the script to work, make sure you have the os, wget, pandas and matplotlib library installed in your python virtual environment.

If you encounter an error while running the cell below, please comment out the first line: #%matplotlib qt

In [4]:
# %matplotlib qt
# %run "magic.py"

Hooray! You have successfully downloaded the data and plotted a graph without any manual intervention. Without opening the magic.py file, are you able to deduce where the data was downloaded to? The printed statements above will provide some hint!

#your answer here

<font color=blue>Bonus: Does the figure look correct? Are you able to explain the negative values and the black lines on the x-axis?</font>

#your answer here

## 1.1 Downloading the iris dataset

Now it is time to import a dataset on your own. The dataset to be used will be the famous Iris Flower dataset. The dataset can be downloaded from http://archive.ics.uci.edu/ml/machine-learning-databases/iris/. The file to download is iris.data. More information on the dataset can be obtained from https://en.wikipedia.org/wiki/Iris_flower_data_set. Please spend some time going through the dataset description before attempting the next set of instructions.

Create a new folder to store the dataset. Write a code below to download the dataset automatically using the urllib.request.urlretrieve function to help you. You can use the code within magic.py as reference.

To access the contents within magic.py, find the magic.py file in the folder. Right-click on it and open it with wordpad.

<font color=blue>Bonus: Download the data using only 2 lines of code!</font>

In [5]:
import urllib.request

urllib.request.urlretrieve("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data", "iris.data")

('iris.data', <http.client.HTTPMessage at 0x28d6290a0e0>)

Well done! We now have our data downloaded! We will now access our data and learn some of its features. To do this, let’s explore a Python library called ‘pandas’!

## 1.2 Introduction to Pandas

Pandas is a powerful tool to import datasets. It organises data into an easily processed [dataframe](https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python) which allows for easy statistical analysis. 

Read this [article](https://towardsdatascience.com/a-quick-introduction-to-the-pandas-python-library-f1b678f34673) and watch this [video](https://www.youtube.com/watch?v=dcqPhpY7tWk) for a quick introduction to pandas: What they are, what are some applications of pandas, and how you can use it.

Pay careful attention to the part about importing data and viewing data, as we will use some of the functions in our exercises later!

Summarise what you learnt about Pandas in your worksheet.
-  How do you install and use Pandas?
-  What are the common type of files that Pandas is used for?
-  What is a dataframe?
-  How do you access the rows and columns in the dataframe?
-  Name and describe some commonly used Pandas functions.

#your answer here

Now, let us use some functions within Pandas to help us access data. The first step is to import Pandas. Try importing pandas as pd.

In [1]:
import pandas as pd

After importing Pandas, we will now try to read in the Iris Flower dataset. It is currently saved as a Comma Separated Values file (CSV). We will need to understand more about CSV files before we can access the data in them.

## 1.2.1 Comma Separated Values (CSV) files

Datasets are mainly stored in CSV files. CSV files contain data that are separated by comma characters or other characters. For example, a CSV file containing names of people may be stored as John,Mary,Harry,Luke. The comma between the names will tell the computer where to separate one name from the other.

The files usually have a .csv extension but there are files which do not follow this extension. One example will be that of the iris data. 

See this [article](https://www.howtogeek.com/348960/what-is-a-csv-file-and-how-do-i-open-it/) to find out more about csv files: What are they? How to access them?

#your answer here

<font color=blue>Bonus: After understanding the nature of CSV files, how would one check whether the data file is a CSV file? Which python function can be used to do this?</font>

#your answer here

## 1.2.2 Iris Flower dataset

The Iris Flower dataset is a csv file, even though it has the extension .data. Now, open the dataset using the pd.read_csv() function and assign it into a variable df. Then, print out the first 5 rows of the dataframe to see the data attribute. What do you notice?

In [5]:
df = pd.read_csv("iris.data")
print(df.head(5))

   5.1  3.5  1.4  0.2  Iris-setosa
0  4.9  3.0  1.4  0.2  Iris-setosa
1  4.7  3.2  1.3  0.2  Iris-setosa
2  4.6  3.1  1.5  0.2  Iris-setosa
3  5.0  3.6  1.4  0.2  Iris-setosa
4  5.4  3.9  1.7  0.4  Iris-setosa


Did you realise the dataframe was missing headers/column names? This happens as the original file does not have header/column names. As such, it is always important to find out more details about the data file before using it. 

We can find out more information in the iris.names file from the previous download [link](http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.names). The iris.names file listed the column names under point 7 (Attribute iInformation). What are the required names?

#your answer here

You can refer to the picture (source: https://www.researchgate.net/figure/Trollius-ranunculoide-flower-with-measured-traits_fig6_272514310) below to understand the variables "sepal_length", "sepal_width", "petal_length" and "petal_width".

<img src = "./resources/PetalSepal1.png">

Now, let us try to include the names into the dataframe. It is necessary to read the data into the dataframe again to specify that the data has missing headers. This will allow us to add the names into the dataframe later. The code to add the first 2 names is shown below. Modify the code to include the other names before printing the top 5 rows again.

In [9]:
df = pd.read_csv("iris.data",header=None)
print(df.head())
names = ["SL","SW","PL","PW","Class"]
df.columns = names

     0    1    2    3            4
0  5.1  3.5  1.4  0.2  Iris-setosa
1  4.9  3.0  1.4  0.2  Iris-setosa
2  4.7  3.2  1.3  0.2  Iris-setosa
3  4.6  3.1  1.5  0.2  Iris-setosa
4  5.0  3.6  1.4  0.2  Iris-setosa


With the proper labels, you can now use pandas to obtain basic information (Number of rows and columns, type of data, number of missing values and basic statistics) about the dataset. Use .info() and .describe() to obtain basic information about the dataset!

In [11]:
print(df.head())

# other important functions
print(df.info())
print(df.describe())

    SL   SW   PL   PW        Class
0  5.1  3.5  1.4  0.2  Iris-setosa
1  4.9  3.0  1.4  0.2  Iris-setosa
2  4.7  3.2  1.3  0.2  Iris-setosa
3  4.6  3.1  1.5  0.2  Iris-setosa
4  5.0  3.6  1.4  0.2  Iris-setosa
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   SL      150 non-null    float64
 1   SW      150 non-null    float64
 2   PL      150 non-null    float64
 3   PW      150 non-null    float64
 4   Class   150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
None
               SL          SW          PL          PW
count  150.000000  150.000000  150.000000  150.000000
mean     5.843333    3.054000    3.758667    1.198667
std      0.828066    0.433594    1.764420    0.763161
min      4.300000    2.000000    1.000000    0.100000
25%      5.100000    2.800000    1.600000    0.300000
50%      5.800000    3.000000    4.350000    1

Based on the information obtained, you should note that there are 150 different flowers in the dataset and that there are no missing values in the dataset.

## You have now mastered the ability to download datasets automatically and import them using Pandas. Additionally, you have also learnt how to use the Pandas functions to obtain basic information about the dataset. Now we will proceed to a class activity where you will have to put all these skills to good use!