In this notebook, some basic procedures that exist in NumPy, pandas, and scikit-learn are covered. This only scratches the surface of the possibilities, and many other functions and code will be used later on. Make sure to search around for the possiblities that exist yourself, and get a grasp of how the modules are called and used. Let's import them in this notebook to start with:

In [2]:
!pip install sklearn

import numpy as np
import pandas as pd
import sklearn



Collecting sklearn
  Downloading https://files.pythonhosted.org/packages/1e/7a/dbb3be0ce9bd5c8b7e3d87328e79063f8b263b2b1bfa4774cb1147bfcd3f/sklearn-0.0.tar.gz
Collecting scikit-learn
[?25l  Downloading https://files.pythonhosted.org/packages/82/d9/69769d4f79f3b719cc1255f9bd2b6928c72f43e6f74084e3c67db86c4d2b/scikit_learn-0.22.1-cp37-cp37m-macosx_10_6_intel.whl (11.0MB)
[K     |████████████████████████████████| 11.0MB 9.2MB/s eta 0:00:01
Collecting joblib>=0.11
[?25l  Downloading https://files.pythonhosted.org/packages/28/5c/cf6a2b65a321c4a209efcdf64c2689efae2cb62661f8f6f4bb28547cf1bf/joblib-0.14.1-py2.py3-none-any.whl (294kB)
[K     |████████████████████████████████| 296kB 8.4MB/s eta 0:00:01
[?25hCollecting scipy>=0.17.0
[?25l  Downloading https://files.pythonhosted.org/packages/85/7a/ae480be23b768910a9327c33517ced4623ba88dc035f9ce0206657c353a9/scipy-1.4.1-cp37-cp37m-macosx_10_6_intel.whl (28.4MB)
[K     |████████████████████████████████| 28.4MB 620kB/s  eta 0:00:01    |████████

# NumPy

NumPy can be used for many different things, most notably for manipulating arrays and performing mathematical operations:

In [2]:
# Create empty arrays/matrices
empty_array = np.zeros(5)

empty_matrix = np.zeros((5,2))

print('Empty array: \n',empty_array)
print('Empty matrix: \n',empty_matrix)

Empty array: 
 [0. 0. 0. 0. 0.]
Empty matrix: 
 [[0. 0.]
 [0. 0.]
 [0. 0.]
 [0. 0.]
 [0. 0.]]


In [4]:
# Create matrices
mat = np.array([[1,2,3],[4,5,6]])
print('Matrix: \n', mat)
print('Transpose: \n', mat.T)
print('Item 2,2: ', mat[1,1])
print('Item 2,3: ', mat[1,2])
print('rows and columns: ', np.shape(mat))
print('Sum total matrix: ', np.sum(mat))
print('Sum row 1: ' , np.sum(mat[0]))
print('Sum row 2: ', np.sum(mat[1]))
print('Sum column 2: ', np.sum(mat,axis=0)[2])

Matrix: 
 [[1 2 3]
 [4 5 6]]
Transpose: 
 [[1 4]
 [2 5]
 [3 6]]
Item 2,2:  5
Item 2,3:  6
rows and columns:  (2, 3)
Sum total matrix:  21
Sum row 1:  6
Sum row 2:  15
Sum column 2:  9


# pandas

### Creating dataframes

pandas is great for reading and creating datasets, as well as performing basic operations on them.

In [41]:
# Creating a matrix with three rows of data
data = [['johannes',10], ['giovanni',2], ['john',3]]

# Creating and printing a pandas DataFrame object from the matrix
df = pd.DataFrame(data)
print(df)

          0   1
0  johannes  10
1  giovanni   2
2      john   3


In [42]:
# Adding columns to the DataFrame object
df.columns = ['names', 'years']
print(df)

      names  years
0  johannes     10
1  giovanni      2
2      john      3


Alternative:

In [43]:
df_2 = pd.DataFrame(data = data, columns = ['names', 'years'])
print(df_2)

      names  years
0  johannes     10
1  giovanni      2
2      john      3


In [44]:
# Taking out a single column and calculating its sum
# This also shows the type of the variable: a 64 bit integer (array)
print(df['years'])
print('Sum of all values in column: ', df['years'].sum())

0    10
1     2
2     3
Name: years, dtype: int64
Sum of all values in column:  15


In [45]:
# Creating a larger matrix
data = [['johannes',10], ['giovanni',2], ['john',3], ['giovanni',2], ['john',3], ['giovanni',2], ['john',3], ['giovanni',2], ['john',3], ['johannes',10]]

# Again, creating a DataFrame object, now with columns
df = pd.DataFrame(data, columns = ['names','years'])

# Print the 5 first (head) and 5 last (tail) observations
print(df.head())
print('\n')
print(df.tail())

      names  years
0  johannes     10
1  giovanni      2
2      john      3
3  giovanni      2
4      john      3


      names  years
5  giovanni      2
6      john      3
7  giovanni      2
8      john      3
9  johannes     10


### Reading files

You can read files:

In [55]:
dataset = pd.read_csv('amounts.csv')
print(dataset.head())

    Odilia Ellerbee  F  1962
0    Rosia    Babst  F  1721
1  Zackary    Ellis  M  1589
2      Leo  Brescia  M  1248
3   Santos  Falcone  M  1743
4       Ta     Burr  F  1704


In [56]:
dataset = pd.read_csv('amounts.csv', names = ['first_name', 'last_name', 'amount'])
print(dataset.head())

        first_name last_name  amount
Odilia    Ellerbee         F    1962
Rosia        Babst         F    1721
Zackary      Ellis         M    1589
Leo        Brescia         M    1248
Santos     Falcone         M    1743


### Using dataframes

In [60]:
# Print all unique values of the column names
print(df['names'].unique())

['johannes' 'giovanni' 'john']


In [61]:
# Print all values and their frequency:
print(df['names'].value_counts())
print(df['years'].value_counts())

john        4
giovanni    4
johannes    2
Name: names, dtype: int64
3     4
2     4
10    2
Name: years, dtype: int64


In [62]:
# Add a column names 'code' with all zeros
df['code'] = np.zeros(10)
print(df)

      names  years  code
0  johannes     10   0.0
1  giovanni      2   0.0
2      john      3   0.0
3  giovanni      2   0.0
4      john      3   0.0
5  giovanni      2   0.0
6      john      3   0.0
7  giovanni      2   0.0
8      john      3   0.0
9  johannes     10   0.0


You can also easily find things in a DataFrame use ```.loc```:

In [63]:
# Rows 2 to 5 and all columns:
print(df.loc[2:5, :])

      names  years  code
2      john      3   0.0
3  giovanni      2   0.0
4      john      3   0.0
5  giovanni      2   0.0


In [64]:
# Looping columns
for variable in df.columns:
    print(df[variable])

0    johannes
1    giovanni
2        john
3    giovanni
4        john
5    giovanni
6        john
7    giovanni
8        john
9    johannes
Name: names, dtype: object
0    10
1     2
2     3
3     2
4     3
5     2
6     3
7     2
8     3
9    10
Name: years, dtype: int64
0    0.0
1    0.0
2    0.0
3    0.0
4    0.0
5    0.0
6    0.0
7    0.0
8    0.0
9    0.0
Name: code, dtype: float64


In [65]:
# Looping columns and obtaining the values (which returns an array)
for variable in df.columns:
    print(df[variable].values)

['johannes' 'giovanni' 'john' 'giovanni' 'john' 'giovanni' 'john'
 'giovanni' 'john' 'johannes']
[10  2  3  2  3  2  3  2  3 10]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
