# Python libraries for data analysis

Whenever we use a particular library, we first have to import the modules:

In [None]:
import numpy as np
import pandas as pd
import sklearn

Notice how we can rename the modules we import. This is especially handy for longer names. You can also import pieces of a module:

In [None]:
from math import pi
print(pi)

3.141592653589793


In this case, we imported a fixed value. We can also import functions, etc.

## NumPy

NumPy can be used for many different things, most notably for manipulating arrays.

In [None]:
# Create empty arrays/matrices
empty_array = np.zeros(5)

empty_matrix = np.zeros((5,2))

print('Empty array: \n',empty_array)
print('Empty matrix: \n',empty_matrix)

Empty array: 
 [0. 0. 0. 0. 0.]
Empty matrix: 
 [[0. 0.]
 [0. 0.]
 [0. 0.]
 [0. 0.]
 [0. 0.]]


In [None]:
# Create matrices
mat = np.array([[1,2,3],[4,5,6]])
print('Matrix: \n',mat)
print('Transpose: \n',mat.T)
print('Item 2,2: ',mat[1,1])
print('Item 2,3: ',mat[1,2])
print('#rows and #columns: ',np.shape(mat))
print('Sum total matrix: ',np.sum(mat))
print('Sum row 1: ',np.sum(mat[0]))
print('Sum row 2: ',np.sum(mat[1]))
print('Sum column 3: ',np.sum(mat,axis=0)[2])

Matrix: 
 [[1 2 3]
 [4 5 6]]
Transpose: 
 [[1 4]
 [2 5]
 [3 6]]
Item 2,2:  5
Item 2,3:  6
#rows and #columns:  (2, 3)
Sum total matrix:  21
Sum row 1:  6
Sum row 2:  15
Sum column 3:  9


## pandas

pandas is great for reading and creating datasets, as well as performing basic operations on them.

In [None]:
# Creating a matrix with three rows of data
data = [['johannes',10],['giovanni',2],['john',3]]

# Creating and printing a pandas DataFrame object from the matrix
df = pd.DataFrame(data)
print(df)

          0   1
0  johannes  10
1  giovanni   2
2      john   3


In [None]:
# Adding columns to the DataFrame object
df.columns = ['names','years']
print(df)

      names  years
0  johannes     10
1  giovanni      2
2      john      3


In [None]:
# Taking out a single column and calculating its sum
# This also shows the type of the variable: a 64bit integer (array)
print(df['years'])
print('Sum of all values in column: ',df['years'].sum())

0    10
1     2
2     3
Name: years, dtype: int64
Sum of all values in column:  15


In [None]:
# Creating a larger matrix
data = [['johannes',10],['giovanni',2],['john',3],['giovanni',2],['john',3],['giovanni',2],['john',3],['giovanni',2],['john',3],['johannes',10]]

# Again, creating a DataFrame object, now with columns
df = pd.DataFrame(data, columns = ['names','years'])

# Print the 5 first (head) and 5 last (tail) observations
print(df.head())
print('\n')
print(df.tail())

      names  years
0  johannes     10
1  giovanni      2
2      john      3
3  giovanni      2
4      john      3


      names  years
5  giovanni      2
6      john      3
7  giovanni      2
8      john      3
9  johannes     10


In [None]:
# Print all unique values of the column names
print(df['names'].unique())

['johannes' 'giovanni' 'john']


In [None]:
# Add a column names 'code' with all zeros
df['code'] = np.zeros(10)
print(df)

      names  years  code
0  johannes     10   0.0
1  giovanni      2   0.0
2      john      3   0.0
3  giovanni      2   0.0
4      john      3   0.0
5  giovanni      2   0.0
6      john      3   0.0
7  giovanni      2   0.0
8      john      3   0.0
9  johannes     10   0.0


You can also easily find things in a DataFrame use ```.loc```:

In [None]:
# Rows 2 to 5 and all columns:
print(df.loc[2:5, :])

      names  years  code
2      john      3   0.0
3  giovanni      2   0.0
4      john      3   0.0
5  giovanni      2   0.0


In [None]:
# Rows 2 to 4 and columns selected by name:
print(df.loc[2:4, ('years', 'code')])

   years  code
2      3   0.0
3      2   0.0
4      3   0.0


In [None]:
# You can get a histogram of particular values in a column:
print(df['names'].value_counts())

john        4
giovanni    4
johannes    2
Name: names, dtype: int64


## scikit-learn

scikit-learn is great for performing all major data analysis operations. It also contains datasets. In this code, we will load a dataset and fit a simple linear regression (more details on that model later).

In [None]:
from sklearn import datasets as ds

# Load the Boston Housing dataset
dataset = ds.load_boston()

# It is a dictionary, see the keys for details:
print(dataset.keys())

dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])


In [None]:
# The 'DESCR' key holds a description text for the whole dataset
print(dataset['DESCR'])

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pu

In [None]:
# The data (independent variables) are stored under the 'data' key
# The names of the independent variables are stored in the 'feature_names' key
# Let's use them to create a DataFrame object:
df = pd.DataFrame(data=dataset['data'], columns=dataset['feature_names'])
print(df.head())

      CRIM    ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD    TAX  \
0  0.00632  18.0   2.31   0.0  0.538  6.575  65.2  4.0900  1.0  296.0   
1  0.02731   0.0   7.07   0.0  0.469  6.421  78.9  4.9671  2.0  242.0   
2  0.02729   0.0   7.07   0.0  0.469  7.185  61.1  4.9671  2.0  242.0   
3  0.03237   0.0   2.18   0.0  0.458  6.998  45.8  6.0622  3.0  222.0   
4  0.06905   0.0   2.18   0.0  0.458  7.147  54.2  6.0622  3.0  222.0   

   PTRATIO       B  LSTAT  
0     15.3  396.90   4.98  
1     17.8  396.90   9.14  
2     17.8  392.83   4.03  
3     18.7  394.63   2.94  
4     18.7  396.90   5.33  


In [None]:
# The dependent variable is stored separately
df_y = pd.DataFrame(data=dataset['target'], columns=['target'])
print(df_y.head())

   target
0    24.0
1    21.6
2    34.7
3    33.4
4    36.2


In [None]:
# Now, let's build a linear regression model
from sklearn.linear_model import LinearRegression as LR

# First we create a linear regression object
regression = LR()

# Then, we fit the independent and dependent data
regression.fit(df, df_y)

# We can obtain the R^2 score (more on this later)
print(regression.score(df, df_y))

0.7406426641094095


Very often, we need to perform an operation on a single observation. In that case, we have to reshape the data using numpy:

In [None]:
# Consider a single observation 
so = df.loc[2, :]
print(so)

CRIM         0.02729
ZN           0.00000
INDUS        7.07000
CHAS         0.00000
NOX          0.46900
RM           7.18500
AGE         61.10000
DIS          4.96710
RAD          2.00000
TAX        242.00000
PTRATIO     17.80000
B          392.83000
LSTAT        4.03000
Name: 2, dtype: float64


In [None]:
# Just the values of the observation without meta data
print(so.values)

[2.7290e-02 0.0000e+00 7.0700e+00 0.0000e+00 4.6900e-01 7.1850e+00
 6.1100e+01 4.9671e+00 2.0000e+00 2.4200e+02 1.7800e+01 3.9283e+02
 4.0300e+00]


In [None]:
# Reshaping yields a new matrix with one row with as many columns as the original observation (indicated by the -1)
# This is often needed to make data compatible with particular methods
print(np.reshape(so.values, (1, -1)))

[[2.7290e-02 0.0000e+00 7.0700e+00 0.0000e+00 4.6900e-01 7.1850e+00
  6.1100e+01 4.9671e+00 2.0000e+00 2.4200e+02 1.7800e+01 3.9283e+02
  4.0300e+00]]


In [None]:
# For two observations:
so_2 = df.loc[2:3, :]
print(np.reshape(so_2.values, (2, -1)))

[[2.7290e-02 0.0000e+00 7.0700e+00 0.0000e+00 4.6900e-01 7.1850e+00
  6.1100e+01 4.9671e+00 2.0000e+00 2.4200e+02 1.7800e+01 3.9283e+02
  4.0300e+00]
 [3.2370e-02 0.0000e+00 2.1800e+00 0.0000e+00 4.5800e-01 6.9980e+00
  4.5800e+01 6.0622e+00 3.0000e+00 2.2200e+02 1.8700e+01 3.9463e+02
  2.9400e+00]]


This concludes our quick run-through of some basic functionality of the modules. We will use more and more specialised functions and objects as you progress through the course, but this has already set you up with the basics for playing around with data.