# Lambda School Data Science - Loading Data

Data comes in many shapes and sizes - we'll start by loading tabular data, usually in csv format.

Data set sources:

- https://archive.ics.uci.edu/ml/datasets.html
- https://github.com/awesomedata/awesome-public-datasets
- https://registry.opendata.aws/ (beyond scope for now, but good to be aware of)

Let's start with an example - [data about flags](https://archive.ics.uci.edu/ml/datasets/Flags).

## Lecture example - flag data

In [59]:
# Step 1 - find the actual file to download

# From navigating the page, clicking "Data Folder"
flag_data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/flags/flag.data'

# You can "shell out" in a notebook for more powerful tools
# https://jakevdp.github.io/PythonDataScienceHandbook/01.05-ipython-and-shell-commands.html

# Funny extension, but on inspection looks like a csv
!curl https://archive.ics.uci.edu/ml/machine-learning-databases/flags/flag.data

# Extensions are just a norm! You have to inspect to be sure what something is

Afghanistan,5,1,648,16,10,2,0,3,5,1,1,0,1,1,1,0,green,0,0,0,0,1,0,0,1,0,0,black,green
Albania,3,1,29,3,6,6,0,0,3,1,0,0,1,0,1,0,red,0,0,0,0,1,0,0,0,1,0,red,red
Algeria,4,1,2388,20,8,2,2,0,3,1,1,0,0,1,0,0,green,0,0,0,0,1,1,0,0,0,0,green,white
American-Samoa,6,3,0,0,1,1,0,0,5,1,0,1,1,1,0,1,blue,0,0,0,0,0,0,1,1,1,0,blue,red
Andorra,3,1,0,0,6,0,3,0,3,1,0,1,1,0,0,0,gold,0,0,0,0,0,0,0,0,0,0,blue,red
Angola,4,2,1247,7,10,5,0,2,3,1,0,0,1,0,1,0,red,0,0,0,0,1,0,0,1,0,0,red,black
Anguilla,1,4,0,0,1,1,0,1,3,0,0,1,0,1,0,1,white,0,0,0,0,0,0,0,0,1,0,white,blue
Antigua-Barbuda,1,4,0,0,1,1,0,1,5,1,0,1,1,1,1,0,red,0,0,0,0,1,0,1,0,0,0,black,red
Argentina,2,3,2777,28,2,0,0,3,2,0,0,1,0,1,0,0,blue,0,0,0,0,0,0,0,0,0,0,blue,blue
Argentine,2,3,2777,28,2,0,0,3,3,0,0,1,1,1,0,0,blue,0,0,0,0,1,0,0,0,0,0,blue,blue
Australia,6,2,7690,15,1,1,0,0,3,1,0,1,0,1,0,0,blue,0,1,1,1,6,0,0,0,0,0,white,blue
Austria,3,1,84,8,4,0,0,3,2,1,0,0,0,1,0,0,red,0,0,0,0,0,0,0,0,0,0,red,red
Bahamas,1,4,19,0,1,1,0,3,3,0,0,1,1,0,1,0,blue,0,0,

In [0]:
# Step 2 - load the data

# How to deal with a csv? 🐼
import pandas as pd
flag_data = pd.read_csv(flag_data_url, header=None)

In [61]:
# Step 3 - verify we've got *something*
flag_data.shape


(194, 30)

In [0]:
flag_data = flag_data.rename(columns={0:'country', 1:'landmass', 2:'zone'})

In [13]:
flag_data.head()

Unnamed: 0,country,landmass,zone,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
0,Afghanistan,5,1,648,16,10,2,0,3,5,...,0,0,1,0,0,1,0,0,black,green
1,Albania,3,1,29,3,6,6,0,0,3,...,0,0,1,0,0,0,1,0,red,red
2,Algeria,4,1,2388,20,8,2,2,0,3,...,0,0,1,1,0,0,0,0,green,white
3,American-Samoa,6,3,0,0,1,1,0,0,5,...,0,0,0,0,1,1,1,0,blue,red
4,Andorra,3,1,0,0,6,0,3,0,3,...,0,0,0,0,0,0,0,0,blue,red


In [7]:
# Step 4 - Looks a bit odd - verify that it is what we want
flag_data.count()

0     194
1     194
2     194
3     194
4     194
5     194
6     194
7     194
8     194
9     194
10    194
11    194
12    194
13    194
14    194
15    194
16    194
17    194
18    194
19    194
20    194
21    194
22    194
23    194
24    194
25    194
26    194
27    194
28    194
29    194
dtype: int64

In [14]:
!curl https://archive.ics.uci.edu/ml/machine-learning-databases/flags/flag.data | wc

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100 15240  100 15240    0     0   102k      0 --:--:-- --:--:-- --:--:--  102k
    194     194   15240


In [15]:
# So we have 193 observations with funny names, file has 194 rows
# Looks like the file has no header row, but read_csv assumes it does
help(pd.read_csv)

Help on function read_csv in module pandas.io.parsers:

read_csv(filepath_or_buffer, sep=',', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, iterator=False, chunksize=None, compression='infer', thousands=None, decimal=b'.', lineterminator=None, quotechar='"', quoting=0, escapechar=None, comment=None, encoding=None, dialect=None, tupleize_cols=None, error_bad_lines=True, warn_bad_lines=True, skipfooter=0, skip_footer=0, doublequote=True, delim_whitespace=False, as_recarray=None, compact_ints=None, use_unsigned=None, low_memory=True, buffer_lines=None, memory_map=False, float_precision=None)


In [0]:
# Alright, we can pass header=None to fix this
flag_data = pd.read_csv(flag_data_url, header=None)
flag_data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
0,Afghanistan,5,1,648,16,10,2,0,3,5,...,0,0,1,0,0,1,0,0,black,green
1,Albania,3,1,29,3,6,6,0,0,3,...,0,0,1,0,0,0,1,0,red,red
2,Algeria,4,1,2388,20,8,2,2,0,3,...,0,0,1,1,0,0,0,0,green,white
3,American-Samoa,6,3,0,0,1,1,0,0,5,...,0,0,0,0,1,1,1,0,blue,red
4,Andorra,3,1,0,0,6,0,3,0,3,...,0,0,0,0,0,0,0,0,blue,red


In [0]:
flag_data.count()

0     194
1     194
2     194
3     194
4     194
5     194
6     194
7     194
8     194
9     194
10    194
11    194
12    194
13    194
14    194
15    194
16    194
17    194
18    194
19    194
20    194
21    194
22    194
23    194
24    194
25    194
26    194
27    194
28    194
29    194
dtype: int64

In [0]:
flag_data.isna().sum()

0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    0
13    0
14    0
15    0
16    0
17    0
18    0
19    0
20    0
21    0
22    0
23    0
24    0
25    0
26    0
27    0
28    0
29    0
dtype: int64

### Yes, but what does it *mean*?

This data is fairly nice - it was "donated" and is already "clean" (no missing values). But there are no variable names - so we have to look at the codebook (also from the site).

```
1. name: Name of the country concerned
2. landmass: 1=N.America, 2=S.America, 3=Europe, 4=Africa, 4=Asia, 6=Oceania
3. zone: Geographic quadrant, based on Greenwich and the Equator; 1=NE, 2=SE, 3=SW, 4=NW
4. area: in thousands of square km
5. population: in round millions
6. language: 1=English, 2=Spanish, 3=French, 4=German, 5=Slavic, 6=Other Indo-European, 7=Chinese, 8=Arabic, 9=Japanese/Turkish/Finnish/Magyar, 10=Others
7. religion: 0=Catholic, 1=Other Christian, 2=Muslim, 3=Buddhist, 4=Hindu, 5=Ethnic, 6=Marxist, 7=Others
8. bars: Number of vertical bars in the flag
9. stripes: Number of horizontal stripes in the flag
10. colours: Number of different colours in the flag
11. red: 0 if red absent, 1 if red present in the flag
12. green: same for green
13. blue: same for blue
14. gold: same for gold (also yellow)
15. white: same for white
16. black: same for black
17. orange: same for orange (also brown)
18. mainhue: predominant colour in the flag (tie-breaks decided by taking the topmost hue, if that fails then the most central hue, and if that fails the leftmost hue)
19. circles: Number of circles in the flag
20. crosses: Number of (upright) crosses
21. saltires: Number of diagonal crosses
22. quarters: Number of quartered sections
23. sunstars: Number of sun or star symbols
24. crescent: 1 if a crescent moon symbol present, else 0
25. triangle: 1 if any triangles present, 0 otherwise
26. icon: 1 if an inanimate image present (e.g., a boat), otherwise 0
27. animate: 1 if an animate image (e.g., an eagle, a tree, a human hand) present, 0 otherwise
28. text: 1 if any letters or writing on the flag (e.g., a motto or slogan), 0 otherwise
29. topleft: colour in the top-left corner (moving right to decide tie-breaks)
30. botright: Colour in the bottom-left corner (moving left to decide tie-breaks)
```

Exercise - read the help for `read_csv` and figure out how to load the data with the above variable names. One pitfall to note - with `header=None` pandas generated variable names starting from 0, but the above list starts from 1...

In [16]:
from google.colab import files
uploaded = files.upload()

Saving imports-85.data to imports-85.data


1. symboling: -3, -2, -1, 0, 1, 2, 3. 
2. normalized-losses: continuous from 65 to 256. 
3. make: 
alfa-romero, audi, bmw, chevrolet, dodge, honda, 
isuzu, jaguar, mazda, mercedes-benz, mercury, 
mitsubishi, nissan, peugot, plymouth, porsche, 
renault, saab, subaru, toyota, volkswagen, volvo 

4. fuel-type: diesel, gas. 
5. aspiration: std, turbo. 
6. num-of-doors: four, two. 
7. body-style: hardtop, wagon, sedan, hatchback, convertible. 
8. drive-wheels: 4wd, fwd, rwd. 
9. engine-location: front, rear. 
10. wheel-base: continuous from 86.6 120.9. 
11. length: continuous from 141.1 to 208.1. 
12. width: continuous from 60.3 to 72.3. 
13. height: continuous from 47.8 to 59.8. 
14. curb-weight: continuous from 1488 to 4066. 
15. engine-type: dohc, dohcv, l, ohc, ohcf, ohcv, rotor. 
16. num-of-cylinders: eight, five, four, six, three, twelve, two. 
17. engine-size: continuous from 61 to 326. 
18. fuel-system: 1bbl, 2bbl, 4bbl, idi, mfi, mpfi, spdi, spfi. 
19. bore: continuous from 2.54 to 3.94. 
20. stroke: continuous from 2.07 to 4.17. 
21. compression-ratio: continuous from 7 to 23. 
22. horsepower: continuous from 48 to 288. 
23. peak-rpm: continuous from 4150 to 6600. 
24. city-mpg: continuous from 13 to 49. 
25. highway-mpg: continuous from 16 to 54. 
26. price: continuous from 5118 to 45400.

In [0]:
df = pd.read_csv('imports-85.data', header=None, names=['symboling', 'norm_loss', 
                              'make', 'fuel', 'aspiration', 'doors', 'body_style', 
                              'drive_wheels', 'engine_location', 'wheel_base',
                              'length','width', 'height','curb_weight','engine',
                              'cylinders','engine_size', 'fuel_system','bore',
                              'stroke','compression','hp','peak_rpm','city_mpg',
                              'hgwy_mpg','price'])
# option for read_csv(xxx.data, na_values=['?']) to replace non standard na values

In [25]:
df.head()

Unnamed: 0,symboling,norm_loss,make,fuel,aspiration,doors,body_style,drive_wheels,engine_location,wheel_base,...,engine_size,fuel_system,bore,stroke,compression,hp,peak_rpm,city_mpg,hgwy_mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


In [0]:
import numpy as np #get numpy for using NAN

In [36]:
df_fixna = df.replace('?', np.NAN) #replace ? missing values with NAN from numpy
df_fixna.head()

Unnamed: 0,symboling,norm_loss,make,fuel,aspiration,doors,body_style,drive_wheels,engine_location,wheel_base,...,engine_size,fuel_system,bore,stroke,compression,hp,peak_rpm,city_mpg,hgwy_mpg,price
0,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164.0,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164.0,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


In [37]:
df_fixna.dtypes

symboling            int64
norm_loss           object
make                object
fuel                object
aspiration          object
doors               object
body_style          object
drive_wheels        object
engine_location     object
wheel_base         float64
length             float64
width              float64
height             float64
curb_weight          int64
engine              object
cylinders           object
engine_size          int64
fuel_system         object
bore                object
stroke              object
compression        float64
hp                  object
peak_rpm            object
city_mpg             int64
hgwy_mpg             int64
price               object
dtype: object

In [38]:
df_fixna.isnull().sum()

symboling           0
norm_loss          41
make                0
fuel                0
aspiration          0
doors               2
body_style          0
drive_wheels        0
engine_location     0
wheel_base          0
length              0
width               0
height              0
curb_weight         0
engine              0
cylinders           0
engine_size         0
fuel_system         0
bore                4
stroke              4
compression         0
hp                  2
peak_rpm            2
city_mpg            0
hgwy_mpg            0
price               4
dtype: int64

In [0]:
df_fltr_na = df_fixna[~df_fixna.isnull().any(axis=1)]

In [48]:
print(df_fltr_na.isnull().sum())
print('\n', df_fltr_na.shape, '\n')
df_fltr_na.head()

symboling          0
norm_loss          0
make               0
fuel               0
aspiration         0
doors              0
body_style         0
drive_wheels       0
engine_location    0
wheel_base         0
length             0
width              0
height             0
curb_weight        0
engine             0
cylinders          0
engine_size        0
fuel_system        0
bore               0
stroke             0
compression        0
hp                 0
peak_rpm           0
city_mpg           0
hgwy_mpg           0
price              0
dtype: int64

 (159, 26) 



Unnamed: 0,symboling,norm_loss,make,fuel,aspiration,doors,body_style,drive_wheels,engine_location,wheel_base,...,engine_size,fuel_system,bore,stroke,compression,hp,peak_rpm,city_mpg,hgwy_mpg,price
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450
6,1,158,audi,gas,std,four,sedan,fwd,front,105.8,...,136,mpfi,3.19,3.4,8.5,110,5500,19,25,17710
8,1,158,audi,gas,turbo,four,sedan,fwd,front,105.8,...,131,mpfi,3.13,3.4,8.3,140,5500,17,20,23875
10,2,192,bmw,gas,std,two,sedan,rwd,front,101.2,...,108,mpfi,3.5,2.8,8.8,101,5800,23,29,16430


In [52]:
from pandas.api.types import is_numeric_dtype

for header in df_fltr_na:
  if is_numeric_dtype(df_fltr_na[header]):
    print('numeric: ', header)
  else:
    print('not numeric: ', header)

numeric:  symboling
not numeric:  norm_loss
not numeric:  make
not numeric:  fuel
not numeric:  aspiration
not numeric:  doors
not numeric:  body_style
not numeric:  drive_wheels
not numeric:  engine_location
numeric:  wheel_base
numeric:  length
numeric:  width
numeric:  height
numeric:  curb_weight
not numeric:  engine
not numeric:  cylinders
numeric:  engine_size
not numeric:  fuel_system
not numeric:  bore
not numeric:  stroke
numeric:  compression
not numeric:  hp
not numeric:  peak_rpm
numeric:  city_mpg
numeric:  hgwy_mpg
not numeric:  price


In [54]:
df_fltr_na.dtypes

symboling            int64
norm_loss           object
make                object
fuel                object
aspiration          object
doors               object
body_style          object
drive_wheels        object
engine_location     object
wheel_base         float64
length             float64
width              float64
height             float64
curb_weight          int64
engine              object
cylinders           object
engine_size          int64
fuel_system         object
bore                object
stroke              object
compression        float64
hp                  object
peak_rpm            object
city_mpg             int64
hgwy_mpg             int64
price               object
dtype: object

In [55]:
df_fltr_na['make'].value_counts()

toyota           31
nissan           18
honda            13
subaru           12
mazda            11
volvo            11
mitsubishi       10
dodge             8
volkswagen        8
peugot            7
plymouth          6
saab              6
mercedes-benz     5
bmw               4
audi              4
chevrolet         3
porsche           1
jaguar            1
Name: make, dtype: int64

## Your assignment - pick a dataset and do something like the above

This is purposely open-ended - you can pick any data set you wish. It is highly advised you pick a dataset from UCI or a similar "clean" source.

If you get that done and want to try more challenging or exotic things, go for it! Use documentation as illustrated above, and follow the 20-minute rule (that is - ask for help if you're stuck).

If you have loaded a few traditional datasets, see the following section for suggested stretch goals.

In [70]:
# TODO your work here!
# And note you should write comments, descriptions, and add new
# code and text blocks as needed

#*** I downloaded this and opened it in libre office as a spreadsheet and used
# that to reformat it as a csv so I could at least get it into pandas as a
# dataframe that I could manipulate. Trying to sort it out from there

from google.colab import files
upload = files.upload()

Saving audiology.csv to audiology.csv


In [0]:
# Attempting to sort out this dataset...
# Appeares to be categorized by conditions? and then the conditions have their 
# noted symptoms for each patient (p1, p2 etc.) attempting to sort it out using
# str.contains cycling through each item and removing the category (bells_palsy,
# acoustic_neuroma) and placing the symptoms into new data frame by category and
#patient

audiology_data = pd.read_csv('audiology.csv', header=None, names=['patient','acoustic_neuroma',
                             'bells_palsy','cochlear_age','cochlear_age_and_noise','cochlear_age_plus_poss_menieres',
                             'cochlear_noise_and_heredity','cochlear_poss_noise','cochlear_unknown',
                             'conductive_discontinuity','conductive_fixation','mixed_cochlear_age_fixation',
                             'mixed_cochlear_age_otitis_media','mixed_cochlear_age_s_om',
                             'mixed_cochlear_unk_discontinuity','mixed_cochlear_unk_fixation',
                             'mixed_cochlear_unk_ser_om','mixed_poss_central_om','mixed_poss_noise_om',
                             'normal_ear','otitis_media','poss_central','possible_brainstem_disorder',
                             'possible_menieres','retrocochlear_unknown'] )

In [81]:
audiology_data.head()

Unnamed: 0,patient,acoustic_neuroma,bells_palsy,cochlear_age,cochlear_age_and_noise,cochlear_age_plus_poss_menieres,cochlear_noise_and_heredity,cochlear_poss_noise,cochlear_unknown,conductive_discontinuity,...,mixed_cochlear_unk_fixation,mixed_cochlear_unk_ser_om,mixed_poss_central_om,mixed_poss_noise_om,normal_ear,otitis_media,poss_central,possible_brainstem_disorder,possible_menieres,retrocochlear_unknown
0,[p1,cochlear_unknown,[boneAbnormal,air(mild),ar_c(normal),ar_u(normal),o_ar_c(normal),o_ar_u(normal),speech(normal),static(normal),...,,,,,,,,,,
1,[p2,cochlear_unknown,[boneAbnormal,air(moderate),ar_c(normal),ar_u(normal),o_ar_c(normal),o_ar_u(normal),speech(normal),static(normal),...,,,,,,,,,,
2,[p3,mixed_cochlear_age_fixation,[age_gt_60,airBoneGap,boneAbnormal,air(mild),ar_u(absent),bone(mild),o_ar_u(absent),speech(normal),...,,,,,,,,,,
3,[p4,mixed_cochlear_age_otitis_media,[age_gt_60,airBoneGap,air(mild),ar_u(absent),bone(mild),o_ar_u(absent),speech(normal),static(normal),...,,,,,,,,,,
4,[p5,cochlear_age,[age_gt_60,boneAbnormal,air(mild),ar_c(normal),ar_u(normal),bone(mild),o_ar_c(normal),o_ar_u(normal),...,,,,,,,,,,


In [84]:
print(len(audiology_data))

200


In [83]:
for item in audiology_data:
  if item != patient:
    for d in range(len(audiology_data)):
      if audiology_data[d,item].str.contains('cochlear')

# I was under the impression that the categories that I set up were how this 
# dataset was organized, and that there was some information for each
  

ValueError: ignored

**I'm going to do some cleaner datasets to show I actually do get the basics. Probably shouldn't have started with audiology**

In [103]:
census_data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'

adult = pd.read_csv(census_data_url, header=None, names=['age','work','fnlwgt',
                   'education','education-num','marital-status','occupation',
                   'relatioship','race','sex','capital-gain','capital-loss',
                   'hours-per-week','native-country', 'income'], na_values=[' ?'])
print(adult.count())
print('\n', adult.shape, '\n')
adult.head()

age               32561
work              30725
fnlwgt            32561
education         32561
education-num     32561
marital-status    32561
occupation        30718
relatioship       32561
race              32561
sex               32561
capital-gain      32561
capital-loss      32561
hours-per-week    32561
native-country    31978
income            32561
dtype: int64

 (32561, 15) 



Unnamed: 0,age,work,fnlwgt,education,education-num,marital-status,occupation,relatioship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [104]:
print(adult.isna().sum())

age                  0
work              1836
fnlwgt               0
education            0
education-num        0
marital-status       0
occupation        1843
relatioship          0
race                 0
sex                  0
capital-gain         0
capital-loss         0
hours-per-week       0
native-country     583
income               0
dtype: int64


In [124]:
print(adult.describe(), '\n\n')  #display statistics about numerical data

# set up for loop to display statistics about columns with missing values to get
# a better idea of what to fill with.

for header in adult:
  if adult[header].isnull().sum() != 0:
    print('\n', header, ':')
    print('\n', adult[header].value_counts(), '\n')
  else:
    print('\n', header, ' has no missing items\n')

                age        fnlwgt  education-num  capital-gain  capital-loss  \
count  32561.000000  3.256100e+04   32561.000000  32561.000000  32561.000000   
mean      38.581647  1.897784e+05      10.080679   1077.648844     87.303830   
std       13.640433  1.055500e+05       2.572720   7385.292085    402.960219   
min       17.000000  1.228500e+04       1.000000      0.000000      0.000000   
25%       28.000000  1.178270e+05       9.000000      0.000000      0.000000   
50%       37.000000  1.783560e+05      10.000000      0.000000      0.000000   
75%       48.000000  2.370510e+05      12.000000      0.000000      0.000000   
max       90.000000  1.484705e+06      16.000000  99999.000000   4356.000000   

       hours-per-week  
count    32561.000000  
mean        40.437456  
std         12.347429  
min          1.000000  
25%         40.000000  
50%         40.000000  
75%         45.000000  
max         99.000000   



 age  has no missing items


 work :

  Private            

In [131]:
# used forward fill for the first two as they had reasonable distributions 
adult['work'] = adult['work'].fillna(method='ffill')

adult['occupation'] = adult['occupation'].fillna(method='ffill')

# Just filled in United States for native country as there were a lot of 
# countries with only a few people from them. I figured this would mess up 
# any results the least. 
adult['native-country'] = adult['native-country'].fillna('United-States')

print(adult.isna().sum())

age               0
work              0
fnlwgt            0
education         0
education-num     0
marital-status    0
occupation        0
relatioship       0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
income            0
dtype: int64


In [139]:
# Some categorical encoding next:
# starting with the income column, 0 is <= $50k; 1 is > $50k

income_dict = {' <=50K':0, ' >50K':1}
adult['income'].replace(income_dict, inplace=True)
adult.head()

Unnamed: 0,age,work,fnlwgt,education,education-num,marital-status,occupation,relatioship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,0
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0


In [140]:
# Marital status one hot encoding:

pd.get_dummies(adult, columns=["marital-status"], prefix=["relationship"]).head()

Unnamed: 0,age,work,fnlwgt,education,education-num,occupation,relatioship,race,sex,capital-gain,...,hours-per-week,native-country,income,relationship_ Divorced,relationship_ Married-AF-spouse,relationship_ Married-civ-spouse,relationship_ Married-spouse-absent,relationship_ Never-married,relationship_ Separated,relationship_ Widowed
0,39,State-gov,77516,Bachelors,13,Adm-clerical,Not-in-family,White,Male,2174,...,40,United-States,0,0,0,0,0,1,0,0
1,50,Self-emp-not-inc,83311,Bachelors,13,Exec-managerial,Husband,White,Male,0,...,13,United-States,0,0,0,1,0,0,0,0
2,38,Private,215646,HS-grad,9,Handlers-cleaners,Not-in-family,White,Male,0,...,40,United-States,0,1,0,0,0,0,0,0
3,53,Private,234721,11th,7,Handlers-cleaners,Husband,Black,Male,0,...,40,United-States,0,0,0,1,0,0,0,0
4,28,Private,338409,Bachelors,13,Prof-specialty,Wife,Black,Female,0,...,40,Cuba,0,0,0,1,0,0,0,0


## Stretch Goals - Other types and sources of data

Not all data comes in a nice single file - for example, image classification involves handling lots of image files. You still will probably want labels for them, so you may have tabular data in addition to the image blobs - and the images may be reduced in resolution and even fit in a regular csv as a bunch of numbers.

If you're interested in natural language processing and analyzing text, that is another example where, while it can be put in a csv, you may end up loading much larger raw data and generating features that can then be thought of in a more standard tabular fashion.

Overall you will in the course of learning data science deal with loading data in a variety of ways. Another common way to get data is from a database - most modern applications are backed by one or more databases, which you can query to get data to analyze. We'll cover this more in our data engineering unit.

How does data get in the database? Most applications generate logs - text files with lots and lots of records of each use of the application. Databases are often populated based on these files, but in some situations you may directly analyze log files. The usual way to do this is with command line (Unix) tools - command lines are intimidating, so don't expect to learn them all at once, but depending on your interests it can be useful to practice.

One last major source of data is APIs: https://github.com/toddmotto/public-apis

API stands for Application Programming Interface, and while originally meant e.g. the way an application interfaced with the GUI or other aspects of an operating system, now it largely refers to online services that let you query and retrieve data. You can essentially think of most of them as "somebody else's database" - you have (usually limited) access.

*Stretch goal* - research one of the above extended forms of data/data loading. See if you can get a basic example working in a notebook. Image, text, or (public) APIs are probably more tractable - databases are interesting, but there aren't many publicly accessible and they require a great deal of setup.

In [0]:
#if I get to this...
def load_a_csv_as(url, df_name, headers_bool, header_names):
  