<a href="https://colab.research.google.com/github/DanielMartinAlarcon/DS-Sprint-01-Dealing-With-Data/blob/master/module2-loadingdata/LS_DS_112_Loading_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lambda School Data Science - Loading Data

Data comes in many shapes and sizes - we'll start by loading tabular data, usually in csv format.

Data set sources:

- https://archive.ics.uci.edu/ml/datasets.html
- https://github.com/awesomedata/awesome-public-datasets
- https://registry.opendata.aws/ (beyond scope for now, but good to be aware of)

Let's start with an example - [data about flags](https://archive.ics.uci.edu/ml/datasets/Flags).

## Lecture example - flag data

In [0]:
# Step 1 - find the actual file to download

# From navigating the page, clicking "Data Folder"
flag_data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/flags/flag.data'

# You can "shell out" in a notebook for more powerful tools
# https://jakevdp.github.io/PythonDataScienceHandbook/01.05-ipython-and-shell-commands.html

# Funny extension, but on inspection looks like a csv
!curl https://archive.ics.uci.edu/ml/machine-learning-databases/flags/flag.data

# Extensions are just a norm! You have to inspect to be sure what something is

In [0]:
# Step 2 - load the data

# How to deal with a csv? 🐼
import pandas as pd
flag_data = pd.read_csv(flag_data_url)

In [0]:
# Step 3 - verify we've got *something*
flag_data.head()

In [0]:
# Step 4 - Looks a bit odd - verify that it is what we want
flag_data.count()

In [0]:
!curl https://archive.ics.uci.edu/ml/machine-learning-databases/flags/flag.data | wc

In [0]:
# So we have 193 observations with funny names, file has 194 rows
# Looks like the file has no header row, but read_csv assumes it does
help(pd.read_csv)

In [0]:
# Alright, we can pass header=None to fix this
flag_data = pd.read_csv(flag_data_url, header=None)
flag_data.head()

In [0]:
flag_data.count()

In [0]:
flag_data.isna().sum()

### Yes, but what does it *mean*?

This data is fairly nice - it was "donated" and is already "clean" (no missing values). But there are no variable names - so we have to look at the codebook (also from the site).

```
1. name: Name of the country concerned
2. landmass: 1=N.America, 2=S.America, 3=Europe, 4=Africa, 4=Asia, 6=Oceania
3. zone: Geographic quadrant, based on Greenwich and the Equator; 1=NE, 2=SE, 3=SW, 4=NW
4. area: in thousands of square km
5. population: in round millions
6. language: 1=English, 2=Spanish, 3=French, 4=German, 5=Slavic, 6=Other Indo-European, 7=Chinese, 8=Arabic, 9=Japanese/Turkish/Finnish/Magyar, 10=Others
7. religion: 0=Catholic, 1=Other Christian, 2=Muslim, 3=Buddhist, 4=Hindu, 5=Ethnic, 6=Marxist, 7=Others
8. bars: Number of vertical bars in the flag
9. stripes: Number of horizontal stripes in the flag
10. colours: Number of different colours in the flag
11. red: 0 if red absent, 1 if red present in the flag
12. green: same for green
13. blue: same for blue
14. gold: same for gold (also yellow)
15. white: same for white
16. black: same for black
17. orange: same for orange (also brown)
18. mainhue: predominant colour in the flag (tie-breaks decided by taking the topmost hue, if that fails then the most central hue, and if that fails the leftmost hue)
19. circles: Number of circles in the flag
20. crosses: Number of (upright) crosses
21. saltires: Number of diagonal crosses
22. quarters: Number of quartered sections
23. sunstars: Number of sun or star symbols
24. crescent: 1 if a crescent moon symbol present, else 0
25. triangle: 1 if any triangles present, 0 otherwise
26. icon: 1 if an inanimate image present (e.g., a boat), otherwise 0
27. animate: 1 if an animate image (e.g., an eagle, a tree, a human hand) present, 0 otherwise
28. text: 1 if any letters or writing on the flag (e.g., a motto or slogan), 0 otherwise
29. topleft: colour in the top-left corner (moving right to decide tie-breaks)
30. botright: Colour in the bottom-left corner (moving left to decide tie-breaks)
```

Exercise - read the help for `read_csv` and figure out how to load the data with the above variable names. One pitfall to note - with `header=None` pandas generated variable names starting from 0, but the above list starts from 1...

In [1]:
import pandas as pd

chess_data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/chess/king-rook-vs-king/krkopt.data'

'''
Attribute Information:
   1. White King file (column)
   2. White King rank (row)
   3. White Rook file
   4. White Rook rank
   5. Black King file
   6. Black King rank
   7. optimal depth-of-win for White in 0 to 16 moves, otherwise drawn
	{draw, zero, one, two, ..., sixteen}.
'''

chess_col_names = ['WKF','WKR','WRF','WRR','BKF','BKR','Moves']

chess_data = pd.read_csv(chess_data_url, header=None, names=chess_col_names)

chess_data.head()

Unnamed: 0,WKF,WKR,WRF,WRR,BKF,BKR,Moves
0,a,1,b,3,c,2,draw
1,a,1,c,1,c,2,draw
2,a,1,c,1,d,1,draw
3,a,1,c,1,d,2,draw
4,a,1,c,2,c,1,draw


In [4]:
chess_data.isnull().sum().sum()

0

New data set, now audiology.
URL: https://archive.ics.uci.edu/ml/datasets/Audiology+%28Standardized%29

Folder: https://archive.ics.uci.edu/ml/machine-learning-databases/audiology/

Data itself: https://archive.ics.uci.edu/ml/machine-learning-databases/audiology/audiology.standardized.data



In [0]:
audio_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/audiology/audiology.standardized.data'
audio_data = pd.read_csv(audio_url, header=None) #Fixed lack of headers

In [31]:
audio_data.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,61,62,63,64,65,66,67,68,69,70
0,f,mild,f,normal,normal,?,t,?,f,f,...,f,f,normal,t,a,f,f,f,p1,cochlear_unknown
1,f,moderate,f,normal,normal,?,t,?,f,f,...,f,f,normal,t,a,f,f,f,p2,cochlear_unknown
2,t,mild,t,?,absent,mild,t,?,f,f,...,f,f,normal,t,as,f,f,f,p3,mixed_cochlear_age_fixation
3,t,mild,t,?,absent,mild,f,?,f,f,...,f,f,normal,t,b,f,f,f,p4,mixed_cochlear_age_otitis_media
4,t,mild,f,normal,normal,mild,t,?,f,f,...,f,f,good,t,a,f,f,f,p5,cochlear_age
5,t,mild,f,normal,normal,mild,t,?,f,f,...,f,f,very_good,t,a,f,f,f,p6,cochlear_age
6,f,mild,f,normal,normal,mild,t,?,f,f,...,f,f,good,t,a,f,f,f,p7,cochlear_unknown
7,f,mild,f,normal,normal,mild,t,?,f,f,...,f,f,very_good,t,a,f,f,f,p8,cochlear_unknown
8,f,severe,f,?,?,?,t,?,f,f,...,f,f,?,t,a,f,f,f,p9,cochlear_unknown
9,t,mild,f,elevated,absent,mild,t,?,f,f,...,f,f,good,t,a,f,f,f,p10,cochlear_age


In [32]:
import numpy as np

# Replace the question marks with actual NaNs
audio_data.replace('?',np.nan, inplace=True)
audio_data.isnull().sum().sum()

291

In [33]:
audio_data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,61,62,63,64,65,66,67,68,69,70
0,f,mild,f,normal,normal,,t,,f,f,...,f,f,normal,t,a,f,f,f,p1,cochlear_unknown
1,f,moderate,f,normal,normal,,t,,f,f,...,f,f,normal,t,a,f,f,f,p2,cochlear_unknown
2,t,mild,t,,absent,mild,t,,f,f,...,f,f,normal,t,as,f,f,f,p3,mixed_cochlear_age_fixation
3,t,mild,t,,absent,mild,f,,f,f,...,f,f,normal,t,b,f,f,f,p4,mixed_cochlear_age_otitis_media
4,t,mild,f,normal,normal,mild,t,,f,f,...,f,f,good,t,a,f,f,f,p5,cochlear_age


In [34]:
# Replace the Ts and Fs with real booleans
audio_data.replace('t',True, inplace=True)
audio_data.replace('f',False, inplace=True)
audio_data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,61,62,63,64,65,66,67,68,69,70
0,False,mild,False,normal,normal,,True,,False,False,...,False,False,normal,True,a,False,False,False,p1,cochlear_unknown
1,False,moderate,False,normal,normal,,True,,False,False,...,False,False,normal,True,a,False,False,False,p2,cochlear_unknown
2,True,mild,True,,absent,mild,True,,False,False,...,False,False,normal,True,as,False,False,False,p3,mixed_cochlear_age_fixation
3,True,mild,True,,absent,mild,False,,False,False,...,False,False,normal,True,b,False,False,False,p4,mixed_cochlear_age_otitis_media
4,True,mild,False,normal,normal,mild,True,,False,False,...,False,False,good,True,a,False,False,False,p5,cochlear_age


In [45]:
audio_data_filled = audio_data.copy()

for column in audio_data.columns:
  column_mode = audio_data[column].mode()[0]
  audio_data_filled[column].fillna(column_mode, inplace=True)
  
audio_data_filled.isnull().sum().sum()

0

In [46]:
audio_data_filled.head(15)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,61,62,63,64,65,66,67,68,69,70
0,False,mild,False,normal,normal,mild,True,degraded,False,False,...,False,False,normal,True,a,False,False,False,p1,cochlear_unknown
1,False,moderate,False,normal,normal,mild,True,degraded,False,False,...,False,False,normal,True,a,False,False,False,p2,cochlear_unknown
2,True,mild,True,normal,absent,mild,True,degraded,False,False,...,False,False,normal,True,as,False,False,False,p3,mixed_cochlear_age_fixation
3,True,mild,True,normal,absent,mild,False,degraded,False,False,...,False,False,normal,True,b,False,False,False,p4,mixed_cochlear_age_otitis_media
4,True,mild,False,normal,normal,mild,True,degraded,False,False,...,False,False,good,True,a,False,False,False,p5,cochlear_age
5,True,mild,False,normal,normal,mild,True,degraded,False,False,...,False,False,very_good,True,a,False,False,False,p6,cochlear_age
6,False,mild,False,normal,normal,mild,True,degraded,False,False,...,False,False,good,True,a,False,False,False,p7,cochlear_unknown
7,False,mild,False,normal,normal,mild,True,degraded,False,False,...,False,False,very_good,True,a,False,False,False,p8,cochlear_unknown
8,False,severe,False,normal,normal,mild,True,degraded,False,False,...,False,False,normal,True,a,False,False,False,p9,cochlear_unknown
9,True,mild,False,elevated,absent,mild,True,degraded,False,False,...,False,False,good,True,a,False,False,False,p10,cochlear_age


## Your assignment - pick a dataset and do something like the above

This is purposely open-ended - you can pick any data set you wish. It is highly advised you pick a dataset from UCI or a similar "clean" source.

If you get that done and want to try more challenging or exotic things, go for it! Use documentation as illustrated above, and follow the 20-minute rule (that is - ask for help if you're stuck).

If you have loaded a few traditional datasets, see the following section for suggested stretch goals.

In [0]:
# TODO your work here!
# And note you should write comments, descriptions, and add new
# code and text blocks as needed

'''
Looking around the UCI database, I found this one that seems good.
Medium-sized dataset with missing values.

URL: https://archive.ics.uci.edu/ml/datasets/Cylinder+Bands
Names: https://archive.ics.uci.edu/ml/machine-learning-databases/cylinder-bands/bands.names
Data: https://archive.ics.uci.edu/ml/machine-learning-databases/cylinder-bands/bands.data
'''

In [62]:
# First of all, load the data. By opening it in a browser, I see from the start
# that there's no header, so I note that when reading the CSV. 

# I also make sure that Pandas displays the full dataframe
pd.set_option('display.height', 1000)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

bands_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/cylinder-bands/bands.data'
bands_raw = pd.read_csv(bands_url, header=None)
print(bands_raw.shape)
bands_raw.head(10)

(541, 40)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39
0,19910108,X126,TVGUIDE,25503,YES,KEY,YES,BENTON,GALLATIN,UNCOATED,UNCOATED,NO,LINE,YES,Motter94,821,2.0,TABLOID,NorthUS,1911,55,46,0.2,17.0,78,0.75,20,13.1,1700,50.5,36.4,0.0,0,2.5,1.0,34,40,105.0,100,band
1,19910109,X266,TVGUIDE,25503,YES,KEY,YES,BENTON,GALLATIN,UNCOATED,UNCOATED,NO,LINE,YES,Motter94,821,2.0,TABLOID,NorthUS,?,55,46,0.3,15.0,80,0.75,20,6.6,1900,54.9,38.5,0.0,0,2.5,0.7,34,40,105.0,100,noband
2,19910104,B7,MODMAT,47201,YES,KEY,YES,BENTON,GALLATIN,UNCOATED,COATED,NO,LINE,YES,WoodHoe70,815,9.0,CATALOG,NorthUS,?,62,40,0.433,16.0,80,?,30,6.5,1850,53.8,39.8,0.0,0,2.8,0.9,40,40,103.87,100,noband
3,19910104,T133,MASSEY,39039,YES,KEY,YES,BENTON,GALLATIN,UNCOATED,UNCOATED,NO,LINE,YES,WoodHoe70,816,9.0,CATALOG,NorthUS,1910,52,40,0.3,16.0,75,0.3125,30,5.6,1467,55.6,38.8,0.0,0,2.5,1.3,40,40,108.06,100,noband
4,19910111,J34,KMART,37351,NO,KEY,YES,BENTON,GALLATIN,UNCOATED,COATED,NO,LINE,YES,WoodHoe70,816,2.0,TABLOID,?,1910,50,46,0.3,17.0,80,0.75,30,0.0,2100,57.5,42.5,5.0,0,2.3,0.6,35,40,106.67,100,noband
5,19910104,T218,MASSEY,38039,YES,KEY,YES,BENTON,GALLATIN,UNCOATED,UNCOATED,NO,LINE,YES,WoodHoe70,816,2.0,CATALOG,NorthUS,1910,50,40,0.267,16.8,76,0.4375,28,8.6,1467,53.8,37.6,5.0,0,2.5,0.8,40,40,103.87,100,noband
6,19910111,X249,ROSES,35751,NO,KEY,YES,BENTON,GALLATIN,COATED,COATED,NO,LINE,YES,Motter94,827,2.0,TABLOID,CANADIAN,1911,50,46,0.3,16.5,75,0.75,30,0.0,2600,62.5,37.5,6.0,0,2.5,0.6,30,40,106.67,100,noband
7,19910111,X788,ROSES,35751,NO,KEY,YES,BENTON,GALLATIN,COATED,COATED,NO,LINE,YES,Motter94,827,9.0,TABLOID,CANAdiAN,1911,50,46,0.2,16.5,75,0.75,28,0.0,2600,62.5,37.5,6.0,0,2.5,1.1,30,40,106.67,100,noband
8,19910112,M372,MODMAT,47201,YES,KEY,YES,BENTON,GALLATIN,UNCOATED,UNCOATED,NO,XYLOL,YES,Albert70,802,7.0,CATALOG,NorthUS,1910,50,45,0.367,12.0,70,0.75,60,0.0,1650,60.2,39.8,1.5,0,3.0,1.0,40,40,103.22,100,band
9,19910114,I320,CHILDCRAFT,37000,YES,KEY,YES,BENTON,GALLATIN,UNCOATED,UNCOATED,NO,LINE,YES,WoodHoe70,815,2.0,CATALOG,NorthUS,1911,65,43,0.333,16.0,75,1,32,22.7,1750,45.5,31.8,0.0,0,3.0,1.0,38,40,106.66,100,noband


In [63]:
# Alright, let's check the missing values.
bands_raw.isnull().sum()

0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    0
13    0
14    0
15    0
16    0
17    1
18    1
19    1
20    1
21    1
22    1
23    2
24    2
25    2
26    2
27    2
28    2
29    2
30    2
31    2
32    2
33    2
34    2
35    2
36    2
37    2
38    2
39    2
dtype: int64

In [64]:
'''
Hmm. This list of null values doesn't actually seem right.  I notice that there's
a '?' in the table above, and searching for that character in the browser view 
of the whole dataset shows that it's all over the place, way more times than
twice per row as suggested here. These missing values must be something else.
But what the hell are they?

I'll now re-load the dataset setting the null value to '?' and hoping that 
those other mysterious NaNs will just be added to the list.  In fact, rather 
than hoping that I'll verify.  I'll count how many '?'s there are and make sure 
that turning them into NaNs gives us the right total (increased by whatever
amount, usually 2, listed in the column above)
'''
bands_raw.isin(['?']).sum()

0       0
1       1
2       0
3       1
4      49
5       0
6      57
7      60
8       0
9       0
10      1
11     25
12     56
13     19
14      1
15      1
16      0
17      3
18    156
19     18
20     54
21      5
22     27
23      2
24      1
25     30
26     63
27     55
28     10
29     55
30     55
31     56
32     54
33      6
34      7
35     54
36      7
37      7
38      3
39      0
dtype: int64

In [65]:
'''
Alright, then. That's the list of numbers that should increase by 2 once I 
turn all the question marks into NaNs and then count the NaNs.
'''

bands = bands_raw.replace('?', np.nan)
bands.isnull().sum()


0       0
1       1
2       0
3       1
4      49
5       0
6      57
7      60
8       0
9       0
10      1
11     25
12     56
13     19
14      1
15      1
16      0
17      4
18    157
19     19
20     55
21      6
22     28
23      4
24      3
25     32
26     65
27     57
28     12
29     57
30     57
31     58
32     56
33      8
34      9
35     56
36      9
37      9
38      5
39      2
dtype: int64

In [67]:
'''
Success!  Looking at the last few columns, they've increased by 2 as expected. 
Whatever the source of those original NaNs, they've all been merged now.
'''
bands.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39
0,19910108,X126,TVGUIDE,25503,YES,KEY,YES,BENTON,GALLATIN,UNCOATED,UNCOATED,NO,LINE,YES,Motter94,821,2.0,TABLOID,NorthUS,1911.0,55,46,0.2,17,78,0.75,20,13.1,1700,50.5,36.4,0,0,2.5,1.0,34,40,105.0,100,band
1,19910109,X266,TVGUIDE,25503,YES,KEY,YES,BENTON,GALLATIN,UNCOATED,UNCOATED,NO,LINE,YES,Motter94,821,2.0,TABLOID,NorthUS,,55,46,0.3,15,80,0.75,20,6.6,1900,54.9,38.5,0,0,2.5,0.7,34,40,105.0,100,noband
2,19910104,B7,MODMAT,47201,YES,KEY,YES,BENTON,GALLATIN,UNCOATED,COATED,NO,LINE,YES,WoodHoe70,815,9.0,CATALOG,NorthUS,,62,40,0.433,16,80,,30,6.5,1850,53.8,39.8,0,0,2.8,0.9,40,40,103.87,100,noband
3,19910104,T133,MASSEY,39039,YES,KEY,YES,BENTON,GALLATIN,UNCOATED,UNCOATED,NO,LINE,YES,WoodHoe70,816,9.0,CATALOG,NorthUS,1910.0,52,40,0.3,16,75,0.3125,30,5.6,1467,55.6,38.8,0,0,2.5,1.3,40,40,108.06,100,noband
4,19910111,J34,KMART,37351,NO,KEY,YES,BENTON,GALLATIN,UNCOATED,COATED,NO,LINE,YES,WoodHoe70,816,2.0,TABLOID,,1910.0,50,46,0.3,17,80,0.75,30,0.0,2100,57.5,42.5,5,0,2.3,0.6,35,40,106.67,100,noband


In [75]:
'''
Now let's look at what those columns are, so as to know how to fill in the 
missing values.  From the website:

6. Number of Attributes: 40 including the class attribute
   -- 20 attributes are numeric, 20 are nominal

7. Attribute Information:
	 1. timestamp: numeric;19500101 - 21001231 
	 2. cylinder number: nominal
	 3. customer: nominal; 
	 4. job number: nominal; 
	 5. grain screened: nominal; yes, no 
	 6. ink color: nominal;  key, type 
	 7. proof on ctd ink:  nominal;  yes, no  
	 8. blade mfg: nominal;  benton, daetwyler, uddeholm 
	 9. cylinder division: nominal;  gallatin, warsaw, mattoon 
	10. paper type: nominal;  uncoated, coated, super 
	11. ink type: nominal;  uncoated, coated, cover 
	12. direct steam: nominal; use; yes, no *
	13. solvent type: nominal;  xylol, lactol, naptha, line, other 
	14. type on cylinder:  nominal;  yes, no  
	15. press type: nominal; use; 70 wood hoe, 70 motter, 70 albert, 94 motter 
	16. press: nominal;  821, 802, 813, 824, 815, 816, 827, 828 
	17. unit number: nominal;  1, 2, 3, 4, 5, 6, 7, 8, 9, 10 
	18. cylinder size: nominal;  catalog, spiegel, tabloid 
	19. paper mill location: nominal; north us, south us, canadian, 
				 scandanavian, mid european
	20. plating tank: nominal; 1910, 1911, other 
	21. proof cut: numeric;  0-100 
	22. viscosity: numeric;  0-100 
	23. caliper: numeric;  0-1.0 
	24. ink temperature: numeric;  5-30 
	25. humifity: numeric;  5-120 
	26. roughness: numeric;  0-2 
	27. blade pressure: numeric;  10-75 
	28. varnish pct: numeric;  0-100 
	29. press speed: numeric;  0-4000 
	30. ink pct: numeric;  0-100 
	31. solvent pct: numeric;  0-100 
	32. ESA Voltage: numeric;  0-16 
	33. ESA Amperage: numeric;  0-10 
	34. wax: numeric ;  0-4.0
	35. hardener:  numeric; 0-3.0 
	36. roller durometer:  numeric;  15-120 
	37. current density:  numeric;  20-50 
	38. anode space ratio:  numeric;  70-130 
	39. chrome content: numeric; 80-120 
	40. band type: nominal; class; band, no band *
'''


# I used RegExr (and like 90 minutes) to parse the text above and extract the 
# column names and a list of the data types (nominal or numeric). Learning
# Regex is totally worth my time, as I've come to this problem many times.

col_names = ['timestamp','cylinder number','customer','job number','grain screened','ink color','proof on ctd ink','blade mfg','cylinder division','paper type','ink type','direct steam','solvent type','type on cylinder','press type','press','unit number','cylinder size','paper mill location','plating tank','proof cut','viscosity','caliper','ink temperature','humifity','roughness','blade pressure','varnish pct','press speed','ink pct','solvent pct','ESA Voltage','ESA Amperage','wax','hardener','roller durometer','current density','anode space ratio','chrome content','band type']
data_types = ['numeric','nominal','nominal','nominal','nominal','nominal',' nominal','nominal','nominal','nominal','nominal','nominal','nominal',' nominal','nominal','nominal','nominal','nominal','nominal','nominal','numeric','numeric','numeric','numeric','numeric','numeric','numeric','numeric','numeric','numeric','numeric','numeric','numeric','numeric ',' numeric',' numeric',' numeric',' numeric','numeric','nominal']

# First, I properly name all the column headers and verify the change.
bands.columns = col_names
bands.head(20)

Unnamed: 0,timestamp,cylinder number,customer,job number,grain screened,ink color,proof on ctd ink,blade mfg,cylinder division,paper type,ink type,direct steam,solvent type,type on cylinder,press type,press,unit number,cylinder size,paper mill location,plating tank,proof cut,viscosity,caliper,ink temperature,humifity,roughness,blade pressure,varnish pct,press speed,ink pct,solvent pct,ESA Voltage,ESA Amperage,wax,hardener,roller durometer,current density,anode space ratio,chrome content,band type
0,19910108,X126,TVGUIDE,25503,YES,KEY,YES,BENTON,GALLATIN,UNCOATED,UNCOATED,NO,LINE,YES,Motter94,821,2.0,TABLOID,NorthUS,1911.0,55.0,46,0.2,17.0,78,0.75,20.0,13.1,1700,50.5,36.4,0.0,0,2.5,1.0,34,40,105.0,100,band
1,19910109,X266,TVGUIDE,25503,YES,KEY,YES,BENTON,GALLATIN,UNCOATED,UNCOATED,NO,LINE,YES,Motter94,821,2.0,TABLOID,NorthUS,,55.0,46,0.3,15.0,80,0.75,20.0,6.6,1900,54.9,38.5,0.0,0,2.5,0.7,34,40,105.0,100,noband
2,19910104,B7,MODMAT,47201,YES,KEY,YES,BENTON,GALLATIN,UNCOATED,COATED,NO,LINE,YES,WoodHoe70,815,9.0,CATALOG,NorthUS,,62.0,40,0.433,16.0,80,,30.0,6.5,1850,53.8,39.8,0.0,0,2.8,0.9,40,40,103.87,100,noband
3,19910104,T133,MASSEY,39039,YES,KEY,YES,BENTON,GALLATIN,UNCOATED,UNCOATED,NO,LINE,YES,WoodHoe70,816,9.0,CATALOG,NorthUS,1910.0,52.0,40,0.3,16.0,75,0.3125,30.0,5.6,1467,55.6,38.8,0.0,0,2.5,1.3,40,40,108.06,100,noband
4,19910111,J34,KMART,37351,NO,KEY,YES,BENTON,GALLATIN,UNCOATED,COATED,NO,LINE,YES,WoodHoe70,816,2.0,TABLOID,,1910.0,50.0,46,0.3,17.0,80,0.75,30.0,0.0,2100,57.5,42.5,5.0,0,2.3,0.6,35,40,106.67,100,noband
5,19910104,T218,MASSEY,38039,YES,KEY,YES,BENTON,GALLATIN,UNCOATED,UNCOATED,NO,LINE,YES,WoodHoe70,816,2.0,CATALOG,NorthUS,1910.0,50.0,40,0.267,16.8,76,0.4375,28.0,8.6,1467,53.8,37.6,5.0,0,2.5,0.8,40,40,103.87,100,noband
6,19910111,X249,ROSES,35751,NO,KEY,YES,BENTON,GALLATIN,COATED,COATED,NO,LINE,YES,Motter94,827,2.0,TABLOID,CANADIAN,1911.0,50.0,46,0.3,16.5,75,0.75,30.0,0.0,2600,62.5,37.5,6.0,0,2.5,0.6,30,40,106.67,100,noband
7,19910111,X788,ROSES,35751,NO,KEY,YES,BENTON,GALLATIN,COATED,COATED,NO,LINE,YES,Motter94,827,9.0,TABLOID,CANAdiAN,1911.0,50.0,46,0.2,16.5,75,0.75,28.0,0.0,2600,62.5,37.5,6.0,0,2.5,1.1,30,40,106.67,100,noband
8,19910112,M372,MODMAT,47201,YES,KEY,YES,BENTON,GALLATIN,UNCOATED,UNCOATED,NO,XYLOL,YES,Albert70,802,7.0,CATALOG,NorthUS,1910.0,50.0,45,0.367,12.0,70,0.75,60.0,0.0,1650,60.2,39.8,1.5,0,3.0,1.0,40,40,103.22,100,band
9,19910114,I320,CHILDCRAFT,37000,YES,KEY,YES,BENTON,GALLATIN,UNCOATED,UNCOATED,NO,LINE,YES,WoodHoe70,815,2.0,CATALOG,NorthUS,1911.0,65.0,43,0.333,16.0,75,1.0,32.0,22.7,1750,45.5,31.8,0.0,0,3.0,1.0,38,40,106.66,100,noband


In [82]:
'''My initial plan was to use a loop to remove NaNs according to the data 
type of that column. In nominal columns, I'd replace NaN with the mode for that column.
In numerical columns, I'd use interpolation or something.  Looking at the data
more closely, though, it seems like it has timestamps but all the entries are
actually independent of each other.  It looks like they're all individual sales
or something, so that the rows are uncorrelated.  Therefore, interpolating 
makes no sense.  Instead, I'll just replace with the mode in all cases.
'''

bands_clean = bands.copy()
for col in bands.columns:
  the_mode = bands_clean[col].mode()[0]
  bands_clean[col].fillna(the_mode, inplace=True)
  
bands_clean.isna().sum()

timestamp              0
cylinder number        0
customer               0
job number             0
grain screened         0
ink color              0
proof on ctd ink       0
blade mfg              0
cylinder division      0
paper type             0
ink type               0
direct steam           0
solvent type           0
type on cylinder       0
press type             0
press                  0
unit number            0
cylinder size          0
paper mill location    0
plating tank           0
proof cut              0
viscosity              0
caliper                0
ink temperature        0
humifity               0
roughness              0
blade pressure         0
varnish pct            0
press speed            0
ink pct                0
solvent pct            0
ESA Voltage            0
ESA Amperage           0
wax                    0
hardener               0
roller durometer       0
current density        0
anode space ratio      0
chrome content         0
band type              0


In [0]:
# All set!

## Stretch Goals - Other types and sources of data

Not all data comes in a nice single file - for example, image classification involves handling lots of image files. You still will probably want labels for them, so you may have tabular data in addition to the image blobs - and the images may be reduced in resolution and even fit in a regular csv as a bunch of numbers.

If you're interested in natural language processing and analyzing text, that is another example where, while it can be put in a csv, you may end up loading much larger raw data and generating features that can then be thought of in a more standard tabular fashion.

Overall you will in the course of learning data science deal with loading data in a variety of ways. Another common way to get data is from a database - most modern applications are backed by one or more databases, which you can query to get data to analyze. We'll cover this more in our data engineering unit.

How does data get in the database? Most applications generate logs - text files with lots and lots of records of each use of the application. Databases are often populated based on these files, but in some situations you may directly analyze log files. The usual way to do this is with command line (Unix) tools - command lines are intimidating, so don't expect to learn them all at once, but depending on your interests it can be useful to practice.

One last major source of data is APIs: https://github.com/toddmotto/public-apis

API stands for Application Programming Interface, and while originally meant e.g. the way an application interfaced with the GUI or other aspects of an operating system, now it largely refers to online services that let you query and retrieve data. You can essentially think of most of them as "somebody else's database" - you have (usually limited) access.

*Stretch goal* - research one of the above extended forms of data/data loading. See if you can get a basic example working in a notebook. I suggset image, text, or (public) API - databases are interesting, but there aren't many publicly accessible and they require a great deal of setup.