# \[Practice\] Lesson 1: Introduction to Data Analysis

Contents:
1. Data types
2. Install and import supporting packages for data analysis
3. Importing data
4. Exporting data

## 1. Data types

### 1.1. Text: str

In [1]:
greeting = "Hello from the other side"
type(greeting)

str

In [2]:
# String can be sliced
print("greeting[0]: %s" % greeting[0])
print("greeting[-1]: %s" % greeting[-1])
print("Length of the string: %d" % len(greeting))

greeting[0]: H
greeting[-1]: e
Length of the string: 25


### 1.2. Numeric: int, float, complex

In [3]:
a = 5
b = 5.0
c = 5.5
z = complex(3,7)
print('Type of a: %s' % type(a))
print('Type of b: %s' % type(b))
print('Type of c: %s' % type(c))
print('Type of z: %s' % type(z))

Type of a: <class 'int'>
Type of b: <class 'float'>
Type of c: <class 'float'>
Type of z: <class 'complex'>


In [4]:
c = 10 - 4.0
print(c)
print("Type of c: %s" % type(c))

6.0
Type of c: <class 'float'>


In [5]:
print("The real part of z is: ", end="")
print(z.real)
print("The imaginary part of z is: ", end="")
print(z.imag)

The real part of z is: 3.0
The imaginary part of z is: 7.0


### 1.3. Sequence: list, tuple, range

In [6]:
A = [1,5,"a", "hello"]
B = (1,5,8,"a")
C = range(5,10)
print("Type of A: %s" % type(A))
print("Type of B: %s" % type(B))
print("Type of C: %s" % type(C))

Type of A: <class 'list'>
Type of B: <class 'tuple'>
Type of C: <class 'range'>


In [7]:
# List can contain any data types and duplicated elements. For example, list can contain a list
A1 = [A, "b", 'b']
A1

[[1, 5, 'a', 'hello'], 'b', 'b']

In [8]:
# Length of list
print("Length of A: %d" % len(A))
print("Length of A1: %d" % len(A1))

Length of A: 4
Length of A1: 3


In [9]:
# Slicing in list, tuple, and range
print(A[0])
print(B[-1])
print(C[-2])

1
a
8


In [10]:
# List is mutable;range and tuple are immutable
A[1] = "replace"
print("A: ", end="")
print(A)
# This line will return error
B[0] = 1

A: [1, 'replace', 'a', 'hello']


TypeError: 'tuple' object does not support item assignment

In [11]:
# This line will return error
C[0] = 2

TypeError: 'range' object does not support item assignment

### 1.4. Mapping: dict

In [12]:
A_dict = dict({"a": 1, "b": 5, "c": 3})
print("A_dict: ", end="")
print(A_dict)
print("Keys: ", end="")
print(A_dict.keys())
print("Values: ", end="")
print(A_dict.values())

A_dict: {'a': 1, 'b': 5, 'c': 3}
Keys: dict_keys(['a', 'b', 'c'])
Values: dict_values([1, 5, 3])


In [13]:
# Dictionary is muatable
A_dict["d"] = 4
A_dict["a"] = 0
print("A_dict: ", end="")
print(A_dict)

A_dict: {'a': 0, 'b': 5, 'c': 3, 'd': 4}


### 1.5. Set

In [14]:
A = set((1,5,5,6))
type(A)
print(A)
B = [5,5,6,3,2,3]
print(set(B))

{1, 5, 6}
{2, 3, 5, 6}


### 1.6. Boolean

In [15]:
a = True
type(a)

bool

In [16]:
b = 1!=1
print(b)

False


## 2. Install and import supporting packages for data analysis

![packages](../images/packages.png)

Today, we will install pandas, numpy, matplotlib, and seaborn. To install, you can use ```pip``` or ```conda```
```bash
pip install <package name>
conda install <package name>
```

Import these packages for use:

In [17]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## 3. Import data

- From local files: text, table
- Request from API

In [18]:
import os
DATA = "../data"

#### Read table-like data

In [19]:
# Use pandas to read table-like data. These files can have ext: .txt, .csv, .xlsx, .tsv
iris = pd.read_table(os.path.join(DATA, "iris.data.txt"), sep=",")
iris.head()

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,TrainingClass
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [20]:
type(iris)

pandas.core.frame.DataFrame

If your files have extension ```.csv``` or ```.xlsx```, use ```pd.read_csv()``` and ```pd.read_excel()```, respectively. Your homeworks: Find some ```.csv``` or ```.xlsx``` and try import data into Python.

#### Read not table formatted data

In [21]:
# The line below will return error because
# cv = pd.read_table(os.path.join(DATA, "cv.txt"), sep=",")
nodes = pd.DataFrame(columns=["ID", "Name", "Image"])
edges = pd.DataFrame(columns = ["ID", "Arg1 ID", "Arg2 ID", "Color code"])
f = open(os.path.join(DATA, "cv.txt"), "r")
for line in f:
    sep = line.strip("\n").split("\t")
    if len(sep) == 3:
        nodes.loc[len(nodes),:] = sep
    else:
        edges.loc[len(edges),:] = sep

In [22]:
nodes

Unnamed: 0,ID,Name,Image
0,N1,Hoa Nguyen,Gau
1,N2,Education,education
2,N3,Luong The Vinh high school for the gifted,LTV
3,N4,"University of Medicine and Pharmacy, Ho Chi Mi...",UMPHCM
4,N5,VietAI,VietAI
5,N6,Research Experience,research
6,N7,Online Research Club,ORC
7,N8,Working Experience,working
8,N9,Cao Thang Eye Hospital,CTEH
9,N10,Hobbies,hobby


In [23]:
edges

Unnamed: 0,ID,Arg1 ID,Arg2 ID,Color code
0,E1,N1,N2,0B806C
1,E2,N2,N3,0B806C
2,E3,N2,N4,0B806C
3,E4,N2,N4,0B806C
4,E16,N2,N5,0B806C
5,E5,N1,N6,0B3080
6,E6,N6,N4,0B3080
7,E7,N6,N7,0B3080
8,E8,N1,N8,805B0B
9,E9,N8,N5,805B0B


## 4. Exporting data

You can save your pre-process data for future analysis.

In [24]:
nodes.to_csv("../output/nodes.tsv", sep="\t")