# Workshop 23. NumPy. Pandas. GitHub. Data analysis project.

## Data Analysis Project

1. Find and download a dataset.
2. Check descriptive statistics of the dataset: mean, median, standard deviation for some of the fields.
3. Plot graphs for some of the fields.
4. Describe in 2-3 paragraphs, what kind of dataset it is. What data is there, what patterns you noticed in the data.

## Data sources

https://dataverse.harvard.edu/

https://datasetsearch.research.google.com/

https://github.com/awesomedata/awesome-public-datasets

Other sources. Useful search keywords are "open dataset" or "public dataset".

For the purposes of the project, you should select datasets from areas where "data" is a table of meaningful numbers, instead of something complex like fMRI data, images, eye-tracking data and so on.

## NumPy

The main object of NumPy is an n-dimensional array.

A 1-dimensional array is a vector. It is similar to a Python list, but it can only hold numbers of the same type.

A 2-dimensional array is a matrix.

In [1]:
import numpy as np

In [2]:

# np.array calls a constructor. There a list of lists is passed to its __init__ function
a = np.array([[1, 0, 1], [0, 2, 1]])
print(a)

[[1 0 1]
 [0 2 1]]


In [3]:
print(a.ndim) # number of dimensions
print(a.shape) # number of elements along each dimension
print(a.size) # number of elements overall
print(a.T) # transposing a matrix


2
(2, 3)
6
[[1 0]
 [0 2]
 [1 1]]


In [4]:
a_single = np.array([1, 2, 9])
a_double = np.array([[1, 2, 9]])
print(a_single)
print(a_double)
print(a_single.T)
print(a_double.T)
print(a_single.shape)
print(a_double.shape)

# making vectors into 1-by-n 2-dimensional arrays simplifies working with them in the context of matrices.

print(a_double * a_double.T)
print(a_single * a_single.T)


[1 2 9]
[[1 2 9]]
[1 2 9]
[[1]
 [2]
 [9]]
(3,)
(1, 3)
[[ 1  2  9]
 [ 2  4 18]
 [ 9 18 81]]
[ 1  4 81]


In [5]:
ab = np.random.rand(3, 3, 4)*10
print(ab)
# Indexing is usually done with tuples
ab_slice = ab[1:, :, :3]
print(ab_slice)

[[[0.24534324 4.99414933 2.94046836 4.93240643]
  [0.94263259 2.16358524 4.36667825 0.45077526]
  [0.22696324 7.81718653 3.94527507 8.11897089]]

 [[8.84799394 9.77641234 3.58034904 6.9392753 ]
  [3.7009597  2.64044577 8.37717427 7.36879792]
  [0.86740153 4.08030747 5.5613773  8.31374097]]

 [[8.86578272 4.49653668 9.9486225  2.53231597]
  [6.35341498 4.65478701 8.34111132 4.39914094]
  [8.43595699 1.45752134 2.8406783  4.0339161 ]]]
[[[8.84799394 9.77641234 3.58034904]
  [3.7009597  2.64044577 8.37717427]
  [0.86740153 4.08030747 5.5613773 ]]

 [[8.86578272 4.49653668 9.9486225 ]
  [6.35341498 4.65478701 8.34111132]
  [8.43595699 1.45752134 2.8406783 ]]]


NumPy ndarrays support basic mathematical operations and perform them much faster than Python lists.

## Pandas

### DataFrame

DataFrame is a two-dimensional indexed array of values with a header. Generally, rows are objects and columns are individual properties, with their names in the header.



In [2]:
import pandas as pd

In [7]:
df_ex_no_labels = pd.DataFrame(ab[1,:,:])
print(df_ex_no_labels)

          0         1         2         3
0  8.847994  9.776412  3.580349  6.939275
1  3.700960  2.640446  8.377174  7.368798
2  0.867402  4.080307  5.561377  8.313741


In [8]:
df_ex = pd.DataFrame(ab[2,:,:], columns=['Value 1', 'Important number', 'Value 2', 'Another value'])
print(df_ex)

    Value 1  Important number   Value 2  Another value
0  8.865783          4.496537  9.948622       2.532316
1  6.353415          4.654787  8.341111       4.399141
2  8.435957          1.457521  2.840678       4.033916


In [9]:
# Usually a data frame is read from a file

data = '''Example 1,Example 2,Example3
1,2,3
2,51,35
3,100,40
4,50,25
'''

with open('df.csv', 'w') as df_file:
    df_file.write(data)

df = pd.read_csv("df.csv")
print(df)

   Example 1  Example 2  Example3
0          1          2         3
1          2         51        35
2          3        100        40
3          4         50        25


In [4]:
df2 = pd.read_csv('https://www.kaggle.com/martj42/international-football-results-from-1872-to-2017?select=results.csv')
print(df2)

ParserError: Error tokenizing data. C error: Expected 1 fields in line 6, saw 2


In [11]:
df2.columns

Index(['Provider', 'Station_ID', 'Composed_Indicator', 'Overall_Availability',
       'Longest_Availability_without_Gaps', 'Continuity_in_Available_Data',
       'Minimum_Relative_Availability_for_a_Month', 'Ratio_of_Outliers',
       'Homogeneity_of_Yearly_Averages', 'Trend_in_Annual_Flows',
       'Trend_in_One_Month'],
      dtype='object')

In [12]:
df2[['Provider', 'Trend_in_One_Month']]

Unnamed: 0,Provider,Trend_in_One_Month
0,ANA,Significant
1,ANA,-
2,BOM,Not_significant
3,BOM,Significant
4,BOM,Not_significant
5,BOM,Not_significant
6,BOM,Not_significant
7,CHDP,Significant
8,CHDP,Significant
9,CHDP,Significant


In [14]:
import numpy as np
df2.pivot_table(index = 'Provider', values = 'Trend_in_One_Month', aggrfunc = np.count())

AttributeError: module 'numpy' has no attribute 'count'