# Fixing `cyl` Data Type
- 2008: extract int from string
- 2018: convert float to int

Load datasets `data_08_v2.csv` and `data_18_v2.csv`. You should've created these data files in the previous section: *Filter, Drop Nulls, Dedupe*.

In [159]:
# load datasets
import pandas as pd

import re

In [160]:
df_08 = pd.read_csv('data_08_v2.csv')
df_18 = pd.read_csv('data_18_v2.csv')

In [161]:
# check value counts for the 2008 cyl column
df_08['cyl'].value_counts()

(6 cyl)     409
(4 cyl)     283
(8 cyl)     199
(5 cyl)      48
(12 cyl)     30
(10 cyl)     14
(2 cyl)       2
(16 cyl)      1
Name: cyl, dtype: int64

Read [this](https://stackoverflow.com/questions/35376387/extract-int-from-string-in-pandas) to help you extract ints from strings in Pandas for the next step.

In [162]:
# Extract int from strings in the 2008 cyl column
    # method 1 - with lambda and apply (less effective)
    
df_08['cyl'].apply(lambda x: int(x[1:3])).value_counts()

6     409
4     283
8     199
5      48
12     30
10     14
2       2
16      1
Name: cyl, dtype: int64

In [163]:
#using regex
dref_08['cyl'].apply(lambda x: re.search(r'\d+', x).group()).value_counts()

6     409
4     283
8     199
5      48
12     30
10     14
2       2
16      1
Name: cyl, dtype: int64

In [164]:
# Extract int from strings in the 2008 cyl column
#1.  df_08['cyl'].str.extract('(\d+)').astype(int)

#2. using regex (PERFECT) 

df_08['cyl']=df_08['cyl'].apply(lambda x: re.search(r'\d+', x)
                           .group())

df_08['cyl'] = df_08['cyl'].apply(lambda x: int(x))
df_08['cyl'].value_counts()

6     409
4     283
8     199
5      48
12     30
10     14
2       2
16      1
Name: cyl, dtype: int64

In [165]:
# Check value counts for 2008 cyl column again to confirm the change
df_08['cyl'].value_counts()
df_08['cyl'].dtype

dtype('int64')

In [166]:
df_18['cyl'].dtype

dtype('float64')

In [167]:
# convert 2018 cyl column to int
df_18['cyl'] =df_18['cyl'].apply(lambda x: int(x))

In [168]:
print(df_18['cyl'].dtypes)
print(df_08['cyl'].dtypes)

int64
int64


In [169]:
df_08.to_csv('data_08_v3.csv', index=False)
df_18.to_csv('data_18_v3.csv', index=False)