# 2.3.7 Loading Data

## Reading in Data Set (using `pandas`)

The here used datasets `Auto.csv` and `Auto.data` can be found [here](https://www.statlearning.com/resources-python)

In [6]:
DATASET_PATH = '../datasets/'

In [14]:
import pandas as pd
csv_path = DATASET_PATH + 'Auto.csv'
Auto = pd.read_csv(csv_path)
Auto

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino
...,...,...,...,...,...,...,...,...,...
392,27.0,4,140.0,86,2790,15.6,82,1,ford mustang gl
393,44.0,4,97.0,52,2130,24.6,82,2,vw pickup
394,32.0,4,135.0,84,2295,11.6,82,1,dodge rampage
395,28.0,4,120.0,79,2625,18.6,82,1,ford ranger


In [8]:
data_path = DATASET_PATH + 'Auto.data'
Auto_Data = pd.read_csv(data_path, delim_whitespace=True)
Auto_Data

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,18.0,8,307.0,130.0,3504.0,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693.0,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436.0,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433.0,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449.0,10.5,70,1,ford torino
...,...,...,...,...,...,...,...,...,...
392,27.0,4,140.0,86.00,2790.0,15.6,82,1,ford mustang gl
393,44.0,4,97.0,52.00,2130.0,24.6,82,2,vw pickup
394,32.0,4,135.0,84.00,2295.0,11.6,82,1,dodge rampage
395,28.0,4,120.0,79.00,2625.0,18.6,82,1,ford ranger


When `DataWrangler` is used as extension in VS-Code a simple statement `Auto['horsepower']` will not show additional information like `Length` or `dtype`

In [11]:
Auto['horsepower']

0      130
1      165
2      150
3      150
4      140
      ... 
392     86
393     52
394     84
395     79
396     82
Name: horsepower, Length: 397, dtype: object

Therefore use a regular `print` statement

In [10]:
print(Auto['horsepower'])

0      130
1      165
2      150
3      150
4      140
      ... 
392     86
393     52
394     84
395     79
396     82
Name: horsepower, Length: 397, dtype: object


In [12]:
import numpy as np

np.unique(Auto['horsepower'])

array(['100', '102', '103', '105', '107', '108', '110', '112', '113',
       '115', '116', '120', '122', '125', '129', '130', '132', '133',
       '135', '137', '138', '139', '140', '142', '145', '148', '149',
       '150', '152', '153', '155', '158', '160', '165', '167', '170',
       '175', '180', '190', '193', '198', '200', '208', '210', '215',
       '220', '225', '230', '46', '48', '49', '52', '53', '54', '58',
       '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70',
       '71', '72', '74', '75', '76', '77', '78', '79', '80', '81', '82',
       '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '93',
       '94', '95', '96', '97', '98', '?'], dtype=object)

In [16]:
Auto = pd.read_csv(data_path,
                   na_values=['?'],
                   delim_whitespace=True)

print(Auto['horsepower'])

0      130.0
1      165.0
2      150.0
3      150.0
4      140.0
       ...  
392     86.0
393     52.0
394     84.0
395     79.0
396     82.0
Name: horsepower, Length: 397, dtype: float64


In [17]:
Auto['horsepower'].sum()

40952.0

In [18]:
Auto.shape

(397, 9)

In [26]:
Auto.isna().any()

mpg             False
cylinders       False
displacement    False
horsepower       True
weight          False
acceleration    False
year            False
origin          False
name            False
dtype: bool

In [27]:
Auto_new = Auto.dropna()
Auto_new.shape

(392, 9)

In [28]:
Auto_new.isna().any()

mpg             False
cylinders       False
displacement    False
horsepower      False
weight          False
acceleration    False
year            False
origin          False
name            False
dtype: bool

## Basics of Selecting Rows and Columns

In [29]:
Auto = Auto_new
Auto.columns

Index(['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
       'acceleration', 'year', 'origin', 'name'],
      dtype='object')

In [30]:
Auto[:3]

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,18.0,8,307.0,130.0,3504.0,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693.0,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436.0,11.0,70,1,plymouth satellite


In [32]:
Auto.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,18.0,8,307.0,130.0,3504.0,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693.0,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436.0,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433.0,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449.0,10.5,70,1,ford torino


In [33]:
idx_80 = Auto['year'] > 80
Auto[idx_80]

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
338,27.2,4,135.0,84.0,2490.0,15.7,81,1,plymouth reliant
339,26.6,4,151.0,84.0,2635.0,16.4,81,1,buick skylark
340,25.8,4,156.0,92.0,2620.0,14.4,81,1,dodge aries wagon (sw)
341,23.5,6,173.0,110.0,2725.0,12.6,81,1,chevrolet citation
342,30.0,4,135.0,84.0,2385.0,12.9,81,1,plymouth reliant
343,39.1,4,79.0,58.0,1755.0,16.9,81,3,toyota starlet
344,39.0,4,86.0,64.0,1875.0,16.4,81,1,plymouth champ
345,35.1,4,81.0,60.0,1760.0,16.1,81,3,honda civic 1300
346,32.3,4,97.0,67.0,2065.0,17.8,81,3,subaru
347,37.0,4,85.0,65.0,1975.0,19.4,81,3,datsun 210 mpg


In [34]:
Auto[['mpg', 'horsepower']]

Unnamed: 0,mpg,horsepower
0,18.0,130.0
1,15.0,165.0
2,18.0,150.0
3,16.0,150.0
4,17.0,140.0
...,...,...
392,27.0,86.0
393,44.0,52.0
394,32.0,84.0
395,28.0,79.0


In [35]:
Auto.index

Index([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,
       ...
       387, 388, 389, 390, 391, 392, 393, 394, 395, 396],
      dtype='int64', length=392)

Reset index of data frame from number to `name` which is present in data frame (should be unique?)

In [44]:
Auto['name'].is_unique

False

In [55]:
Auto['name'].unique().shape

(301,)

In [36]:
Auto_re = Auto.set_index('name')
Auto_re

Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
chevrolet chevelle malibu,18.0,8,307.0,130.0,3504.0,12.0,70,1
buick skylark 320,15.0,8,350.0,165.0,3693.0,11.5,70,1
plymouth satellite,18.0,8,318.0,150.0,3436.0,11.0,70,1
amc rebel sst,16.0,8,304.0,150.0,3433.0,12.0,70,1
ford torino,17.0,8,302.0,140.0,3449.0,10.5,70,1
...,...,...,...,...,...,...,...,...
ford mustang gl,27.0,4,140.0,86.0,2790.0,15.6,82,1
vw pickup,44.0,4,97.0,52.0,2130.0,24.6,82,2
dodge rampage,32.0,4,135.0,84.0,2295.0,11.6,82,1
ford ranger,28.0,4,120.0,79.0,2625.0,18.6,82,1


In [37]:
Auto

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,18.0,8,307.0,130.0,3504.0,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693.0,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436.0,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433.0,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449.0,10.5,70,1,ford torino
...,...,...,...,...,...,...,...,...,...
392,27.0,4,140.0,86.0,2790.0,15.6,82,1,ford mustang gl
393,44.0,4,97.0,52.0,2130.0,24.6,82,2,vw pickup
394,32.0,4,135.0,84.0,2295.0,11.6,82,1,dodge rampage
395,28.0,4,120.0,79.0,2625.0,18.6,82,1,ford ranger


In [38]:
Auto_re.columns

Index(['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
       'acceleration', 'year', 'origin'],
      dtype='object')

In [39]:
Auto.columns

Index(['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
       'acceleration', 'year', 'origin', 'name'],
      dtype='object')

In [56]:
rows = [ 'amc rebel sst' , 'ford torino' ]
Auto_re.loc[rows]

Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
amc rebel sst,16.0,8,304.0,150.0,3433.0,12.0,70,1
ford torino,17.0,8,302.0,140.0,3449.0,10.5,70,1


retrieve single **rows** using the `iloc[]` method (here the 4th and 5th row)

In [59]:
Auto_re.iloc[[3,4]]

Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
amc rebel sst,16.0,8,304.0,150.0,3433.0,12.0,70,1
ford torino,17.0,8,302.0,140.0,3449.0,10.5,70,1


retrieve all rows but only with selected **columns** (here 1st, 3rd and 4th)

In [60]:
Auto_re.iloc[:,[0,2,3]]

Unnamed: 0_level_0,mpg,displacement,horsepower
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
chevrolet chevelle malibu,18.0,307.0,130.0
buick skylark 320,15.0,350.0,165.0
plymouth satellite,18.0,318.0,150.0
amc rebel sst,16.0,304.0,150.0
ford torino,17.0,302.0,140.0
...,...,...,...
ford mustang gl,27.0,140.0,86.0
vw pickup,44.0,97.0,52.0
dodge rampage,32.0,135.0,84.0
ford ranger,28.0,120.0,79.0


Syntax :`<data_frame>.iloc[[row_start, row_end], [column_start, column_end]]`

In [61]:
Auto_re.iloc[[3,4],[0,2,3]]

Unnamed: 0_level_0,mpg,displacement,horsepower
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
amc rebel sst,16.0,304.0,150.0
ford torino,17.0,302.0,140.0


Index entries need to be unique which is not the case here (should be checked before using `set_index()`)

In [62]:
Auto_re.loc[ 'ford galaxie 500' ,[ 'mpg' , 'origin' ]]

Unnamed: 0_level_0,mpg,origin
name,Unnamed: 1_level_1,Unnamed: 2_level_1
ford galaxie 500,15.0,1
ford galaxie 500,14.0,1
ford galaxie 500,14.0,1


## More on Selecting Rows and Columns

Instead of using a boolean entry to select rows matching a statement (like shown bellow)...

In [63]:
idx_80 = Auto_re['year'] > 80
Auto_re.loc[idx_80, ['weight', 'origin']]

Unnamed: 0_level_0,weight,origin
name,Unnamed: 1_level_1,Unnamed: 2_level_1
plymouth reliant,2490.0,1
buick skylark,2635.0,1
dodge aries wagon (sw),2620.0,1
chevrolet citation,2725.0,1
plymouth reliant,2385.0,1
toyota starlet,1755.0,3
plymouth champ,1875.0,1
honda civic 1300,1760.0,3
subaru,2065.0,3
datsun 210 mpg,1975.0,3


In [64]:
same_name = Auto['name'].value_counts() > 1
same_name = same_name[same_name].index
same_name

Index(['amc matador', 'ford pinto', 'toyota corolla', 'toyota corona',
       'amc hornet', 'chevrolet chevette', 'chevrolet impala', 'amc gremlin',
       'peugeot 504', 'ford maverick', 'ford gran torino', 'honda civic',
       'chevrolet caprice classic', 'dodge colt', 'volkswagen dasher',
       'plymouth duster', 'chevrolet citation', 'chevrolet nova',
       'pontiac catalina', 'plymouth fury iii', 'ford galaxie 500',
       'chevrolet vega', 'buick century', 'volkswagen rabbit',
       'amc matador (sw)', 'honda civic cvcc', 'ford gran torino (sw)',
       'plymouth reliant', 'honda accord', 'saab 99le',
       'chevrolet chevelle malibu', 'mazda 626', 'chevrolet malibu', 'subaru',
       'ford ltd', 'vw rabbit', 'datsun 710', 'plymouth valiant',
       'pontiac phoenix', 'chevrolet chevelle malibu classic', 'datsun 210',
       'fiat 128', 'opel manta', 'audi 100ls', 'toyota corolla 1200',
       'toyota mark ii', 'ford country squire (sw)', 'buick skylark',
       'opel 1900',

... this can be achieved using `lambda` functions within the `loc[]` method

In [65]:
Auto_re.loc[lambda df: df['year'] > 80, ['weight', 'origin']]

Unnamed: 0_level_0,weight,origin
name,Unnamed: 1_level_1,Unnamed: 2_level_1
plymouth reliant,2490.0,1
buick skylark,2635.0,1
dodge aries wagon (sw),2620.0,1
chevrolet citation,2725.0,1
plymouth reliant,2385.0,1
toyota starlet,1755.0,3
plymouth champ,1875.0,1
honda civic 1300,1760.0,3
subaru,2065.0,3
datsun 210 mpg,1975.0,3


Using `lambda` calls with element-wise *and* (&) operation

In [66]:
Auto_re.loc[lambda df: (df['year'] > 80) & (df['mpg'] > 30), ['weight', 'origin']]

Unnamed: 0_level_0,weight,origin
name,Unnamed: 1_level_1,Unnamed: 2_level_1
toyota starlet,1755.0,3
plymouth champ,1875.0,1
honda civic 1300,1760.0,3
subaru,2065.0,3
datsun 210 mpg,1975.0,3
toyota tercel,2050.0,3
mazda glc 4,1985.0,3
plymouth horizon 4,2215.0,1
ford escort 4w,2045.0,1
volkswagen jetta,2190.0,2


Using `lambda` calls with element-wise **Or** operation (|)

In [67]:
Auto_re.loc[lambda df: (df['year'] > 80) | (df['mpg'] > 30), ['weight', 'origin']]

Unnamed: 0_level_0,weight,origin
name,Unnamed: 1_level_1,Unnamed: 2_level_1
toyota corolla 1200,1773.0,3
datsun 1200,1613.0,3
datsun b210,1950.0,3
toyota corolla 1200,1836.0,3
toyota corona,1649.0,3
...,...,...
ford mustang gl,2790.0,1
vw pickup,2130.0,2
dodge rampage,2295.0,1
ford ranger,2625.0,1
