# UNIT 1. Import and Visualitzation of data with Python

This Unit includes some fast shortcuts to representing data in Python, closely following {cite:p}`kroese2020`

NOTE: being this a training tool, we assume the user will have installed all the needed requirements. In case some package is missing, use a terminal to install it apart using your favorite package manager.

These notebooks have been tested in Linux Ubuntu using Anaconda as the package manager in most cases. The notebooks are self-containing.

## Retrieving data

Typically the files containing data to be analyzed are stored in comma separated value (CSV) format. To work with them, the first thing to do is downloading the data. But before we ensure we have a proper folder to download the file.

In [47]:
import os

directory = "datasets"

# Check if the directory exists
if not os.path.exists(directory):
    # If it doesn't exist, create it
    os.makedirs(directory)
else:
    print('folder ',directory,' exists')

# now use wget to download the file into the datasets folder
!wget -P $directory -nc https://archive.ics.uci.edu/static/public/109/wine.zip 

folder  datasets  exists
El fitxer ‘datasets/wine.zip’ ja existeix, no es baixa.



Now, we read the content of the zip file using the `pandas` module. Informatiuon of the data can be found in the [ML repository at UCI](https://archive.ics.uci.edu/dataset/109/wine).

In [48]:
import pandas as pd
from zipfile import ZipFile

with ZipFile('datasets/wine.zip', 'r') as f:

#extract in current directory
    f.extractall(directory, members =['wine.names',"wine.data"])

wine = pd.read_csv('datasets/wine.data',header=None)
wine.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


as we do not have the names of the columns, we can add manually assign them based on the information contained in the web link.

In [49]:
wine.columns = ['class','Alcohol','Maliacid','Ash','Alcalinity_of_ash','Magnesium','Total_phenols','Flavanoids','Nonflavonoid_phenols','Proanthocyanins','Color_intensity','Hue','0D280_0D315_of_diluted_wines','Proline']
wine.head()

Unnamed: 0,class,Alcohol,Maliacid,Ash,Alcalinity_of_ash,Magnesium,Total_phenols,Flavanoids,Nonflavonoid_phenols,Proanthocyanins,Color_intensity,Hue,0D280_0D315_of_diluted_wines,Proline
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


Alternatively, we can directly read the CSV from its URL, without previously downloading it. We will use [Fisher's `iris` data](https://vincentarelbundock.github.io/Rdatasets/doc/datasets/iris.html). Here we already have the name of the features so we read them directly from the dataset.


In [50]:
dataname = 'https://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv'

iris = pd.read_csv(dataname)
iris.head()

Unnamed: 0.1,Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
0,1,5.1,3.5,1.4,0.2,setosa
1,2,4.9,3.0,1.4,0.2,setosa
2,3,4.7,3.2,1.3,0.2,setosa
3,4,4.6,3.1,1.5,0.2,setosa
4,5,5.0,3.6,1.4,0.2,setosa


as the first column is a duplicated index column, we can remove it

In [53]:
iris = iris.drop(index='Unnamed:0')

KeyError: "['Unnamed:0'] not found in axis"