# Lambda School Data Science - Loading Data

Data comes in many shapes and sizes - we'll start by loading tabular data, usually in csv format.

Data set sources:

- https://archive.ics.uci.edu/ml/datasets.html
- https://github.com/awesomedata/awesome-public-datasets
- https://registry.opendata.aws/ (beyond scope for now, but good to be aware of)

Let's start with an example - [data about flags](https://archive.ics.uci.edu/ml/datasets/Flags).

## Lecture example - flag data

In [34]:
# Step 1 - find the actual file to download

# From navigating the page, clicking "Data Folder"
flag_data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/flags/flag.data'

# You can "shell out" in a notebook for more powerful tools
# https://jakevdp.github.io/PythonDataScienceHandbook/01.05-ipython-and-shell-commands.html

# Funny extension, but on inspection looks like a csv
!curl https://archive.ics.uci.edu/ml/machine-learning-databases/flags/flag.data

# Extensions are just a norm! You have to inspect to be sure what something is

Afghanistan,5,1,648,16,10,2,0,3,5,1,1,0,1,1,1,0,green,0,0,0,0,1,0,0,1,0,0,black,green
Albania,3,1,29,3,6,6,0,0,3,1,0,0,1,0,1,0,red,0,0,0,0,1,0,0,0,1,0,red,red
Algeria,4,1,2388,20,8,2,2,0,3,1,1,0,0,1,0,0,green,0,0,0,0,1,1,0,0,0,0,green,white
American-Samoa,6,3,0,0,1,1,0,0,5,1,0,1,1,1,0,1,blue,0,0,0,0,0,0,1,1,1,0,blue,red
Andorra,3,1,0,0,6,0,3,0,3,1,0,1,1,0,0,0,gold,0,0,0,0,0,0,0,0,0,0,blue,red
Angola,4,2,1247,7,10,5,0,2,3,1,0,0,1,0,1,0,red,0,0,0,0,1,0,0,1,0,0,red,black
Anguilla,1,4,0,0,1,1,0,1,3,0,0,1,0,1,0,1,white,0,0,0,0,0,0,0,0,1,0,white,blue
Antigua-Barbuda,1,4,0,0,1,1,0,1,5,1,0,1,1,1,1,0,red,0,0,0,0,1,0,1,0,0,0,black,red
Argentina,2,3,2777,28,2,0,0,3,2,0,0,1,0,1,0,0,blue,0,0,0,0,0,0,0,0,0,0,blue,blue
Argentine,2,3,2777,28,2,0,0,3,3,0,0,1,1,1,0,0,blue,0,0,0,0,1,0,0,0,0,0,blue,blue
Australia,6,2,7690,15,1,1,0,0,3,1,0,1,0,1,0,0,blue,0,1,1,1,6,0,0,0,0,0,white,blue
Austria,3,1,84,8,4,0,0,3,2,1,0,0,0,1,0,0,red,0,0,0,0,0,0,0,0,0,0,red,red
Bahamas,1,4,19,0,1,1,0,3,3,0,0,1,1,0,1,0,blue,0,0,

In [0]:
# Step 2 - load the data

# How to deal with a csv? 🐼
import pandas as pd
flag_data = pd.read_csv(flag_data_url)

In [3]:
# Step 3 - verify we've got *something*
flag_data.head()

Unnamed: 0,Afghanistan,5,1,648,16,10,2,0,3,5.1,...,0.5,0.6,1.6,0.7,0.8,1.7,0.9,0.10,black,green.1
0,Albania,3,1,29,3,6,6,0,0,3,...,0,0,1,0,0,0,1,0,red,red
1,Algeria,4,1,2388,20,8,2,2,0,3,...,0,0,1,1,0,0,0,0,green,white
2,American-Samoa,6,3,0,0,1,1,0,0,5,...,0,0,0,0,1,1,1,0,blue,red
3,Andorra,3,1,0,0,6,0,3,0,3,...,0,0,0,0,0,0,0,0,blue,red
4,Angola,4,2,1247,7,10,5,0,2,3,...,0,0,1,0,0,1,0,0,red,black


In [4]:
# Step 4 - Looks a bit odd - verify that it is what we want
flag_data.count()

Afghanistan    193
5              193
1              193
648            193
16             193
10             193
2              193
0              193
3              193
5.1            193
1.1            193
1.2            193
0.1            193
1.3            193
1.4            193
1.5            193
0.2            193
green          193
0.3            193
0.4            193
0.5            193
0.6            193
1.6            193
0.7            193
0.8            193
1.7            193
0.9            193
0.10           193
black          193
green.1        193
dtype: int64

In [5]:
!curl https://archive.ics.uci.edu/ml/machine-learning-databases/flags/flag.data | wc

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100 15240  100 15240    0     0  52551      0 --:--:-- --:--:-- --:--:-- 52371
    194     194   15240


In [6]:
# So we have 193 observations with funny names, file has 194 rows
# Looks like the file has no header row, but read_csv assumes it does
help(pd.read_csv)

Help on function read_csv in module pandas.io.parsers:

read_csv(filepath_or_buffer, sep=',', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, iterator=False, chunksize=None, compression='infer', thousands=None, decimal=b'.', lineterminator=None, quotechar='"', quoting=0, escapechar=None, comment=None, encoding=None, dialect=None, tupleize_cols=None, error_bad_lines=True, warn_bad_lines=True, skipfooter=0, skip_footer=0, doublequote=True, delim_whitespace=False, as_recarray=None, compact_ints=None, use_unsigned=None, low_memory=True, buffer_lines=None, memory_map=False, float_precision=None)


In [7]:
# Alright, we can pass header=None to fix this
flag_data = pd.read_csv(flag_data_url, header=None)
flag_data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
0,Afghanistan,5,1,648,16,10,2,0,3,5,...,0,0,1,0,0,1,0,0,black,green
1,Albania,3,1,29,3,6,6,0,0,3,...,0,0,1,0,0,0,1,0,red,red
2,Algeria,4,1,2388,20,8,2,2,0,3,...,0,0,1,1,0,0,0,0,green,white
3,American-Samoa,6,3,0,0,1,1,0,0,5,...,0,0,0,0,1,1,1,0,blue,red
4,Andorra,3,1,0,0,6,0,3,0,3,...,0,0,0,0,0,0,0,0,blue,red


In [8]:
flag_data.count()

0     194
1     194
2     194
3     194
4     194
5     194
6     194
7     194
8     194
9     194
10    194
11    194
12    194
13    194
14    194
15    194
16    194
17    194
18    194
19    194
20    194
21    194
22    194
23    194
24    194
25    194
26    194
27    194
28    194
29    194
dtype: int64

In [9]:
flag_data.isna().sum()

0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    0
13    0
14    0
15    0
16    0
17    0
18    0
19    0
20    0
21    0
22    0
23    0
24    0
25    0
26    0
27    0
28    0
29    0
dtype: int64

### Yes, but what does it *mean*?

This data is fairly nice - it was "donated" and is already "clean" (no missing values). But there are no variable names - so we have to look at the codebook (also from the site).

```
1. name: Name of the country concerned
2. landmass: 1=N.America, 2=S.America, 3=Europe, 4=Africa, 4=Asia, 6=Oceania
3. zone: Geographic quadrant, based on Greenwich and the Equator; 1=NE, 2=SE, 3=SW, 4=NW
4. area: in thousands of square km
5. population: in round millions
6. language: 1=English, 2=Spanish, 3=French, 4=German, 5=Slavic, 6=Other Indo-European, 7=Chinese, 8=Arabic, 9=Japanese/Turkish/Finnish/Magyar, 10=Others
7. religion: 0=Catholic, 1=Other Christian, 2=Muslim, 3=Buddhist, 4=Hindu, 5=Ethnic, 6=Marxist, 7=Others
8. bars: Number of vertical bars in the flag
9. stripes: Number of horizontal stripes in the flag
10. colours: Number of different colours in the flag
11. red: 0 if red absent, 1 if red present in the flag
12. green: same for green
13. blue: same for blue
14. gold: same for gold (also yellow)
15. white: same for white
16. black: same for black
17. orange: same for orange (also brown)
18. mainhue: predominant colour in the flag (tie-breaks decided by taking the topmost hue, if that fails then the most central hue, and if that fails the leftmost hue)
19. circles: Number of circles in the flag
20. crosses: Number of (upright) crosses
21. saltires: Number of diagonal crosses
22. quarters: Number of quartered sections
23. sunstars: Number of sun or star symbols
24. crescent: 1 if a crescent moon symbol present, else 0
25. triangle: 1 if any triangles present, 0 otherwise
26. icon: 1 if an inanimate image present (e.g., a boat), otherwise 0
27. animate: 1 if an animate image (e.g., an eagle, a tree, a human hand) present, 0 otherwise
28. text: 1 if any letters or writing on the flag (e.g., a motto or slogan), 0 otherwise
29. topleft: colour in the top-left corner (moving right to decide tie-breaks)
30. botright: Colour in the bottom-left corner (moving left to decide tie-breaks)
```

Exercise - read the help for `read_csv` and figure out how to load the data with the above variable names. One pitfall to note - with `header=None` pandas generated variable names starting from 0, but the above list starts from 1...

## Lecture Example: Loading King Rook Chess Data


In [10]:
import pandas as pd

chess_data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/chess/king-rook-vs-king/krkopt.data'

chess_col_names = ('white_king_file', 'white_king_rank', 'white_rook_rank', 'black_king_file', 'black_king_rank', 'moves_to_mate')

#step one read file
chess_data = pd.read_csv(chess_data_url, header=None, names=chess_col_names)
chess_data.head(5)

chess_data.isna().sum()


white_king_file    0
white_king_rank    0
white_rook_rank    0
black_king_file    0
black_king_rank    0
moves_to_mate      0
dtype: int64

In [11]:
'''   1. White King file (column)
   2. White King rank (row)
   3. White Rook file
   4. White Rook rank
   5. Black King file
   6. Black King rank
   7. optimal depth-of-win for White in 0 to 16 moves, otherwise drawn
	{draw, zero, one, two, ..., sixteen}.
  '''

'   1. White King file (column)\n   2. White King rank (row)\n   3. White Rook file\n   4. White Rook rank\n   5. Black King file\n   6. Black King rank\n   7. optimal depth-of-win for White in 0 to 16 moves, otherwise drawn\n\t{draw, zero, one, two, ..., sixteen}.\n  '

## Lecture Example:  Audiology Data - standardized import and attempt to clean up. 



In [12]:
import pandas as pd

pd.set_option('display.height', 1000)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

audio_data_url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/audiology/audiology.standardized.data'

#step one read file
audio_data = pd.read_csv(audio_data_url)
print(audio_data.shape)
audio_data.head(5)

(199, 71)


Unnamed: 0,f,mild,f.1,normal,normal.1,?,t,?.1,f.2,f.3,f.4,f.5,f.6,f.7,f.8,f.9,f.10,f.11,f.12,f.13,f.14,f.15,f.16,f.17,f.18,f.19,f.20,f.21,f.22,f.23,f.24,f.25,f.26,f.27,f.28,f.29,f.30,f.31,f.32,f.33,f.34,f.35,f.36,f.37,f.38,f.39,f.40,f.41,f.42,f.43,f.44,f.45,f.46,f.47,f.48,f.49,f.50,f.51,normal.2,normal.3,f.52,f.53,f.54,normal.4,t.1,a,f.55,f.56,f.57,p1,cochlear_unknown
0,f,moderate,f,normal,normal,?,t,?,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,normal,normal,f,f,f,normal,t,a,f,f,f,p2,cochlear_unknown
1,t,mild,t,?,absent,mild,t,?,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,?,absent,f,f,f,normal,t,as,f,f,f,p3,mixed_cochlear_age_fixation
2,t,mild,t,?,absent,mild,f,?,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,?,absent,f,f,f,normal,t,b,f,f,f,p4,mixed_cochlear_age_otitis_media
3,t,mild,f,normal,normal,mild,t,?,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,normal,normal,f,f,f,good,t,a,f,f,f,p5,cochlear_age
4,t,mild,f,normal,normal,mild,t,?,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,normal,normal,f,f,f,very_good,t,a,f,f,f,p6,cochlear_age


In [13]:
#still missing rows, lets kill the header
audio_data = pd.read_csv(audio_data_url, header = None)
print(audio_data.shape)
audio_data.head(5)

(200, 71)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70
0,f,mild,f,normal,normal,?,t,?,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,normal,normal,f,f,f,normal,t,a,f,f,f,p1,cochlear_unknown
1,f,moderate,f,normal,normal,?,t,?,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,normal,normal,f,f,f,normal,t,a,f,f,f,p2,cochlear_unknown
2,t,mild,t,?,absent,mild,t,?,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,?,absent,f,f,f,normal,t,as,f,f,f,p3,mixed_cochlear_age_fixation
3,t,mild,t,?,absent,mild,f,?,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,?,absent,f,f,f,normal,t,b,f,f,f,p4,mixed_cochlear_age_otitis_media
4,t,mild,f,normal,normal,mild,t,?,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,normal,normal,f,f,f,good,t,a,f,f,f,p5,cochlear_age


In [14]:
#validate # of rows
!curl http://archive.ics.uci.edu/ml/machine-learning-databases/audiology/audiology.standardized.data | w

 00:25:00 up 8 min,  0 users,  load average: 0.00, 0.03, 0.03
USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0 36 38788   36 14204    0     0    98k      0 --:--:-- --:--:-- --:--:--   98k
curl: (23) Failed writing body (0 != 14204)


In [0]:
#validate the columns
 #look at comment headers.


In [15]:
'copy' in dir(audio_data)

True

In [16]:
audio_data.iloc[0]

0                    f
1                 mild
2                    f
3               normal
4               normal
5                    ?
6                    t
7                    ?
8                    f
9                    f
10                   f
11                   f
12                   f
13                   f
14                   f
15                   f
16                   f
17                   f
18                   f
19                   f
20                   f
21                   f
22                   f
23                   f
24                   f
25                   f
26                   f
27                   f
28                   f
29                   f
30                   f
31                   f
32                   f
33                   f
34                   f
35                   f
36                   f
37                   f
38                   f
39                   f
40                   f
41                   f
42                   f
43         

In [17]:
#missing numbers? 
audio_data = pd.read_csv(audio_data_url, header = None, na_values=['?'])
audio_data.isna().sum().sum() 

291

In [18]:
import numpy as np
audio_data.replace('?', np.nan, inplace = True)
audio_data.isna().sum().sum()

291

In [19]:
audio_data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70
0,f,mild,f,normal,normal,,t,,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,normal,normal,f,f,f,normal,t,a,f,f,f,p1,cochlear_unknown
1,f,moderate,f,normal,normal,,t,,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,normal,normal,f,f,f,normal,t,a,f,f,f,p2,cochlear_unknown
2,t,mild,t,,absent,mild,t,,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,,absent,f,f,f,normal,t,as,f,f,f,p3,mixed_cochlear_age_fixation
3,t,mild,t,,absent,mild,f,,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,,absent,f,f,f,normal,t,b,f,f,f,p4,mixed_cochlear_age_otitis_media
4,t,mild,f,normal,normal,mild,t,,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,normal,normal,f,f,f,good,t,a,f,f,f,p5,cochlear_age


In [20]:
#clean up the t/f values
audio_data.replace('t', True, inplace=True)
audio_data.replace('f', False, inplace=True)
audio_data.head()
#this is clean enough. 

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70
0,False,mild,False,normal,normal,,True,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,normal,normal,False,False,False,normal,True,a,False,False,False,p1,cochlear_unknown
1,False,moderate,False,normal,normal,,True,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,normal,normal,False,False,False,normal,True,a,False,False,False,p2,cochlear_unknown
2,True,mild,True,,absent,mild,True,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,absent,False,False,False,normal,True,as,False,False,False,p3,mixed_cochlear_age_fixation
3,True,mild,True,,absent,mild,False,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,absent,False,False,False,normal,True,b,False,False,False,p4,mixed_cochlear_age_otitis_media
4,True,mild,False,normal,normal,mild,True,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,normal,normal,False,False,False,good,True,a,False,False,False,p5,cochlear_age


In [21]:
#How do we *clean the data* like actually handle missing values
audio_data.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70
count,200,200,200,196,197,125,200,4,200,200,200,200,200,200,200,200,200,200,200,200,200,200,200,200,200,200,200,200,200,200,200,200,200,200,200,200,200,200,200,200,200,200,200,200,200,200,200,200,200,200,200,200,200,200,200,200,200,200,195,198,200,200,200,194,200,200,200,200,200,200,200
unique,2,5,2,3,3,4,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,2,2,2,2,2,2,1,2,2,1,2,2,2,2,2,2,2,2,2,1,2,2,2,2,2,1,1,2,1,2,2,2,2,2,2,2,3,3,2,2,2,6,2,5,2,2,2,200,24
top,False,mild,False,normal,normal,mild,False,degraded,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,normal,normal,False,False,False,normal,True,a,False,False,False,p132,cochlear_unknown
freq,124,101,177,117,121,55,150,2,199,180,193,199,198,188,148,198,196,191,194,198,199,200,199,198,193,195,197,199,200,199,199,200,199,199,199,199,192,184,196,191,199,200,197,199,199,198,199,200,200,199,200,199,193,191,197,197,175,189,124,114,198,194,198,83,188,169,199,198,198,1,48


In [22]:
audio_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 71 columns):
0     200 non-null bool
1     200 non-null object
2     200 non-null bool
3     196 non-null object
4     197 non-null object
5     125 non-null object
6     200 non-null bool
7     4 non-null object
8     200 non-null bool
9     200 non-null bool
10    200 non-null bool
11    200 non-null bool
12    200 non-null bool
13    200 non-null bool
14    200 non-null bool
15    200 non-null bool
16    200 non-null bool
17    200 non-null bool
18    200 non-null bool
19    200 non-null bool
20    200 non-null bool
21    200 non-null bool
22    200 non-null bool
23    200 non-null bool
24    200 non-null bool
25    200 non-null bool
26    200 non-null bool
27    200 non-null bool
28    200 non-null bool
29    200 non-null bool
30    200 non-null bool
31    200 non-null bool
32    200 non-null bool
33    200 non-null bool
34    200 non-null bool
35    200 non-null bool
36    200 non-null bool

In [0]:
#help(audio_data.dropna)

In [24]:
audio_data_forward_filled = audio_data.fillna(method='ffill')
audio_data_forward_filled.head(5)
#not really appropriate because it doesn't get us very close to a complete sample. 


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70
0,False,mild,False,normal,normal,,True,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,normal,normal,False,False,False,normal,True,a,False,False,False,p1,cochlear_unknown
1,False,moderate,False,normal,normal,,True,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,normal,normal,False,False,False,normal,True,a,False,False,False,p2,cochlear_unknown
2,True,mild,True,normal,absent,mild,True,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,normal,absent,False,False,False,normal,True,as,False,False,False,p3,mixed_cochlear_age_fixation
3,True,mild,True,normal,absent,mild,False,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,normal,absent,False,False,False,normal,True,b,False,False,False,p4,mixed_cochlear_age_otitis_media
4,True,mild,False,normal,normal,mild,True,,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,normal,normal,False,False,False,good,True,a,False,False,False,p5,cochlear_age


In [25]:
help(audio_data.fillna)

Help on method fillna in module pandas.core.frame:

fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs) method of pandas.core.frame.DataFrame instance
    Fill NA/NaN values using the specified method
    
    Parameters
    ----------
    value : scalar, dict, Series, or DataFrame
        Value to use to fill holes (e.g. 0), alternately a
        dict/Series/DataFrame of values specifying which value to use for
        each index (for a Series) or column (for a DataFrame). (values not
        in the dict/Series/DataFrame will not be filled). This value cannot
        be a list.
    method : {'backfill', 'bfill', 'pad', 'ffill', None}, default None
        Method to use for filling holes in reindexed Series
        pad / ffill: propagate last valid observation forward to next valid
        backfill / bfill: use NEXT valid observation to fill gap
    axis : {0 or 'index', 1 or 'columns'}
    inplace : boolean, default False
        If True, fil

In [0]:
#lets try filling with the mode of the column
#get column name

audio_data_filled = audio_data.copy()

for column in audio_data:
  column_mode = audio_data_filled[column].mode()[0] #mode returns a series. Can't do fillna with a series. Need to select the zero'th item from the series to return as mode. 
  #print (column_mode) test early and often. 
  audio_data_filled[column].fillna(column_mode, inplace = True)


In [27]:
audio_data_filled.isnull().sum().sum()

0

In [28]:
'''# For each numeric column, fill missing values with that column's mean
audio_data_copy = audio_data.copy()
df = audio_data_copy

numeric = df.select_dtypes(include=[np.number])

for column in numeric:
  mean = df[column].mean()
  df[column].fillna(mean, inplace=True)

# For each nonnumeric column, fill missing values with that column's mode

nonnumeric = df.select_dtypes(exclude=[np.number])

for column in nonnumeric:
  mode = df[column].mode()
  df[column].fillna(mode, inplace=True)'''
  


"# For each numeric column, fill missing values with that column's mean\naudio_data_copy = audio_data.copy()\ndf = audio_data_copy\n\nnumeric = df.select_dtypes(include=[np.number])\n\nfor column in numeric:\n  mean = df[column].mean()\n  df[column].fillna(mean, inplace=True)\n\n# For each nonnumeric column, fill missing values with that column's mode\n\nnonnumeric = df.select_dtypes(exclude=[np.number])\n\nfor column in nonnumeric:\n  mode = df[column].mode()\n  df[column].fillna(mode, inplace=True)"

## Your assignment - pick a dataset and do something like the above

This is purposely open-ended - you can pick any data set you wish. It is highly advised you pick a dataset from UCI or a similar "clean" source.

If you get that done and want to try more challenging or exotic things, go for it! Use documentation as illustrated above, and follow the 20-minute rule (that is - ask for help if you're stuck).

If you have loaded a few traditional datasets, see the following section for suggested stretch goals.

In [29]:
import pandas as pd
import numpy as np
mushroom_ds_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data'

mushrooms = pd.read_csv(mushroom_ds_url)
mushrooms.head(5)
# looks like there's no header potentially no header? need to check again. Yep. No header


Unnamed: 0,p,x,s,n,t,p.1,f,c,n.1,k,e,e.1,s.1,s.2,w,w.1,p.2,w.2,o,p.3,k.1,s.3,u
0,e,x,s,y,t,a,f,c,b,k,e,c,s,s,w,w,p,w,o,p,n,n,g
1,e,b,s,w,t,l,f,c,b,n,e,c,s,s,w,w,p,w,o,p,n,n,m
2,p,x,y,w,t,p,f,c,n,n,e,e,s,s,w,w,p,w,o,p,k,s,u
3,e,x,s,g,f,n,f,w,b,k,t,e,s,s,w,w,p,w,o,e,n,a,g
4,e,x,y,y,t,a,f,c,b,n,e,c,s,s,w,w,p,w,o,p,k,n,g


In [30]:
#import with no-header
mushrooms = pd.read_csv(mushroom_ds_url, header = None)
mushrooms.head(5)
#much better. 22 attributes and 22 columns. 

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22
0,p,x,s,n,t,p,f,c,n,k,e,e,s,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,e,c,s,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,e,c,s,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,e,e,s,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,t,e,s,s,w,w,p,w,o,e,n,a,g


In [37]:
import numpy as np
import re
from pandas.compat import StringIO
temp="""1. cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s 
2. cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s 
3. cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r, pink=p,purple=u,red=e,white=w,yellow=y 
4. bruises?: bruises=t,no=f 
5. odor: almond=a,anise=l,creosote=c,fishy=y,foul=f, musty=m,none=n,pungent=p,spicy=s 
6. gill-attachment: attached=a,descending=d,free=f,notched=n 
7. gill-spacing: close=c,crowded=w,distant=d 
8. gill-size: broad=b,narrow=n 
9. gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e, white=w,yellow=y 
10. stalk-shape: enlarging=e,tapering=t 
11. stalk-root: bulbous=b,club=c,cup=u,equal=e, rhizomorphs=z,rooted=r,missing=? 
12. stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s 
13. stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s 
14. stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y 
15. stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y 
16. veil-type: partial=p,universal=u 
17. veil-color: brown=n,orange=o,white=w,yellow=y 
18. ring-number: none=n,one=o,two=t 
19. ring-type: cobwebby=c,evanescent=e,flaring=f,large=l, none=n,pendant=p,sheathing=s,zone=z 
20. spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r, orange=o,purple=u,white=w,yellow=y 
21. population: abundant=a,clustered=c,numerous=n, scattered=s,several=v,solitary=y 
22. habitat: grasses=g,leaves=l,meadows=m,paths=p, urban=u,waste=w,woods=d
"""
temp = re.sub('.+\. ',"",temp) # remove numbering and spaces
temp = re.sub(':+.*',",", temp) # remove colon, spaces, and indexes

#this code pushed it into a single string, which is an iterator and thus can be looped over just like a list, instead character by character
#temp = re.sub('.+\. ',"\'",temp) # remove numbering and spaces
#temp = re.sub(':+.*',"\',", temp) # remove colon, spaces, and indexes
#temp = re.sub("^\'+","(\'",temp) # add '(' to beginning of string
#temp = re.sub("\W+$","\')",temp) # add ')' to the end and remove last comma.


temp = temp.replace('\n', ' ') # get rid of pesky newlines
print(temp)

cap-shape, cap-surface, cap-color, bruises?, odor, gill-attachment, gill-spacing, gill-size, gill-color, stalk-shape, stalk-root, stalk-surface-above-ring, stalk-surface-below-ring, stalk-color-above-ring, stalk-color-below-ring, veil-type, veil-color, ring-number, ring-type, spore-print-color, population, habitat, 


In [39]:
#copying and pasting it worked.... errg.
#mushroom_attribute_names =('cap-shape', 'cap-surface', 'cap-color', 'bruises?', 'odor', 'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color', 'stalk-shape', 'stalk-root', 'stalk-surface-above-ring', 'stalk-surface-below-ring', 'stalk-color-above-ring', 'stalk-color-below-ring', 'veil-type', 'veil-color', 'ring-number', 'ring-type', 'spore-print-color', 'population', 'habitat')
mushroom_attribute_names = temp.split(',') # we treat it like a single string, then split it into a list of strings, seperated by commas.
print(mushroom_attribute_names)
#reimport with mushroom attributes as names 
#mushroom_attribute_names = temp

#import with no-header
mushrooms = pd.read_csv(mushroom_ds_url, header = None, names = mushroom_attribute_names)
mushrooms.head(5)


['cap-shape', ' cap-surface', ' cap-color', ' bruises?', ' odor', ' gill-attachment', ' gill-spacing', ' gill-size', ' gill-color', ' stalk-shape', ' stalk-root', ' stalk-surface-above-ring', ' stalk-surface-below-ring', ' stalk-color-above-ring', ' stalk-color-below-ring', ' veil-type', ' veil-color', ' ring-number', ' ring-type', ' spore-print-color', ' population', ' habitat', ' ']


Unnamed: 0,cap-shape,cap-surface,cap-color,bruises?,odor,gill-attachment,gill-spacing,gill-size,gill-color,stalk-shape,stalk-root,stalk-surface-above-ring,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat,Unnamed: 23
0,p,x,s,n,t,p,f,c,n,k,e,e,s,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,e,c,s,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,e,c,s,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,e,e,s,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,t,e,s,s,w,w,p,w,o,e,n,a,g


## Stretch Goals - Other types and sources of data

Not all data comes in a nice single file - for example, image classification involves handling lots of image files. You still will probably want labels for them, so you may have tabular data in addition to the image blobs - and the images may be reduced in resolution and even fit in a regular csv as a bunch of numbers.

If you're interested in natural language processing and analyzing text, that is another example where, while it can be put in a csv, you may end up loading much larger raw data and generating features that can then be thought of in a more standard tabular fashion.

Overall you will in the course of learning data science deal with loading data in a variety of ways. Another common way to get data is from a database - most modern applications are backed by one or more databases, which you can query to get data to analyze. We'll cover this more in our data engineering unit.

How does data get in the database? Most applications generate logs - text files with lots and lots of records of each use of the application. Databases are often populated based on these files, but in some situations you may directly analyze log files. The usual way to do this is with command line (Unix) tools - command lines are intimidating, so don't expect to learn them all at once, but depending on your interests it can be useful to practice.

**Aaron suggests doing the API stretch goal**
One last major source of data is APIs: https://github.com/toddmotto/public-apis

API stands for Application Programming Interface, and while originally meant e.g. the way an application interfaced with the GUI or other aspects of an operating system, now it largely refers to online services that let you query and retrieve data. You can essentially think of most of them as "somebody else's database" - you have (usually limited) access.

*Stretch goal* - research one of the above extended forms of data/data loading. See if you can get a basic example working in a notebook. I suggset image, text, or (public) API - databases are interesting, but there aren't many publicly accessible and they require a great deal of setup.

In [33]:
#https://www.metaweather.com/api/
import pandas as pd
weather_data_url = 'https://www.metaweather.com/api/location/2487956/2013/4/30'
weather_data = pd.read_json(weather_data_url)
print (weather_data)


    air_pressure applicable_date                      created  humidity      id  max_temp  min_temp  predictability  the_temp  visibility weather_state_abbr weather_state_name  wind_direction wind_direction_compass  wind_speed
0         1012.0      2013-04-30  2013-04-30T22:55:17.582290Z      37.0  429009       NaN       NaN              68     27.67   18.191263                  c              Clear           315.0                     NW    9.260890
1            NaN      2013-04-30  2013-04-30T20:55:13.980010Z      61.0  418759     22.28     11.01              70       NaN    9.997862                 lc        Light Cloud           299.0                    WNW   12.350000
2         1012.0      2013-04-30  2013-04-30T18:55:10.671820Z      37.0  415934       NaN       NaN              68     27.67   18.191263                  c              Clear           315.0                     NW    9.260890
3            NaN      2013-04-30  2013-04-30T16:55:05.437890Z      66.0  422086     22.24   