<h1 align=center id="data_acquisition"> Data types</h1>
<div class="alert alert-block alert-info" style="margin-top: 20px">
    
<p>
Data types are one of those things that you don’t tend to care about (if the dataframe is built correctly) until you get an error or some unexpected results. It is also one of the first things you should check once you load a new data into pandas for further analysis.

In [1]:
import pandas as pd

In [12]:
df = pd.read_excel('datos_auto2.xlsx')

In [13]:
df.head()

Unnamed: 0.1,Unnamed: 0,make,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,length,width,...,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price,city-L/100km,horsepower-binned,diesel,gas
0,0,alfa-romero,std,two,convertible,rwd,front,88.6,0.811148,0.890278,...,9.0,111.0,5000,21,27,13495,1.119048e+16,Medium,0,1
1,1,alfa-romero,std,two,convertible,rwd,front,88.6,0.811148,0.890278,...,9.0,111.0,5000,21,27,16500,1.119048e+16,Medium,0,1
2,2,alfa-romero,std,two,hatchback,rwd,front,94.5,0.822681,0.909722,...,9.0,154.0,5000,19,26,16500,1236842000000000.0,Medium,0,1
3,3,audi,std,four,sedan,fwd,front,99.8,0.84863,0.919444,...,10.0,102.0,5500,24,30,13950,9791667000000000.0,Medium,0,1
4,4,audi,std,four,sedan,4wd,front,99.4,0.84863,0.922222,...,8.0,115.0,5500,18,22,17450,1.305556e+16,Medium,0,1


In [4]:
df.dtypes

Unnamed: 0             int64
make                  object
aspiration            object
num-of-doors          object
body-style            object
drive-wheels          object
engine-location       object
wheel-base           float64
length               float64
width                float64
height               float64
curb-weight            int64
engine-type           object
num-of-cylinders      object
engine-size            int64
fuel-system           object
bore                 float64
stroke               float64
compression-ratio    float64
horsepower           float64
peak-rpm               int64
city-mpg               int64
highway-mpg            int64
price                  int64
city-L/100km         float64
horsepower-binned     object
diesel                 int64
gas                    int64
dtype: object

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 201 entries, 0 to 200
Data columns (total 28 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         201 non-null    int64  
 1   make               201 non-null    object 
 2   aspiration         201 non-null    object 
 3   num-of-doors       198 non-null    object 
 4   body-style         201 non-null    object 
 5   drive-wheels       201 non-null    object 
 6   engine-location    201 non-null    object 
 7   wheel-base         198 non-null    float64
 8   length             201 non-null    float64
 9   width              201 non-null    float64
 10  height             201 non-null    float64
 11  curb-weight        201 non-null    int64  
 12  engine-type        201 non-null    object 
 13  num-of-cylinders   201 non-null    object 
 14  engine-size        201 non-null    int64  
 15  fuel-system        201 non-null    object 
 16  bore               201 non

<div class="alert alert-block alert-info" style="margin-top: 20px">
    There are basically 5 types of data:
<ul>
    <li> <b>int64</b>, Integer numbers</li>
    <li> <b>float64</b>, Floating point numbers </li>
    <li> <b>object</b>, Text or mixed numeric and non-numeric values</li>
    <li> <b>bool</b>, True/False values</li>
    <li> <b>datetime64</b>, Date and time values </li>
</ul></b>

In [6]:
df['city-mpg'].dtype

dtype('int64')

<div class="alert alert-block alert-info" style="margin-top: 20px">
    <b>astype()</b>, change the data type of a column (NOTE: the output is a <i>view</i>, you must assign the output to the column)

In [None]:
# df['city-mpg'].astype('float64')
# df['city-mpg'] = df['city-mpg'].astype('float64')
df['city-mpg'].dtype

In [15]:
# A type ‘O’ just stands for “object” 
df['body-style'].dtype

dtype('O')

<div class="alert alert-block alert-info" style="margin-top: 20px">

<h2> Missing Data</h2>
<ul>
    <li> <b>NaN</b>, Not a Number</li>
    <li> <b>NaT</b>, Not a Time </li>
</ul>
<p>
    <h4> NA handling methods:</h4>
<ul>
    <li> <b>isnull()</b>, return lile-type object containing boolean values indicanting which values are missing /NA</li>
    <li> <b>notnull()</b>, the negation of <i>isnull</i> </li>
    <li> <b>dropna()</b>, filter axis labels based on whether values for each label have missing data </li>
    <li> <b>fillna()</b>, fill in missing data with a constant value or the previous (<i>method='ffill'</i>), or the next value (<i>method='bfill'</i>) </li>
</ul>    

In [22]:
type(df.loc[9,'num-of-doors'])

nan

In [23]:
serie_doors = df['num-of-doors']
serie_doors

0       two
1       two
2       two
3      four
4      four
       ... 
196    four
197    four
198    four
199    four
200    four
Name: num-of-doors, Length: 201, dtype: object

In [26]:
serie_doors[9]

nan

In [27]:
serie_doors[serie_doors.notnull()]

0       two
1       two
2       two
3      four
4      four
       ... 
196    four
197    four
198    four
199    four
200    four
Name: num-of-doors, Length: 198, dtype: object

<div class="alert alert-block alert-info" style="margin-top: 20px">

With DataFrame objects you may want to drop rows or columns which are all NA of just those containing any NAs

In [29]:
import numpy as np

In [32]:
df2 = pd.DataFrame([[1.0, 2.4, 0.35],[0.5,np.nan, 2.7],[np.nan, np.nan, np.nan],[np.nan, 6.5, 2.1]])
df2

Unnamed: 0,0,1,2
0,1.0,2.4,0.35
1,0.5,,2.7
2,,,
3,,6.5,2.1


In [33]:
df2.dropna()

Unnamed: 0,0,1,2
0,1.0,2.4,0.35


In [35]:
df2.dropna(how='all')

Unnamed: 0,0,1,2
0,1.0,2.4,0.35
1,0.5,,2.7
3,,6.5,2.1


Dropping columns in the same way with a parameter: <i>axis=1</i> or <i>axis='columns'</i>

In [37]:
df2.dropna(how='all', axis='columns')

Unnamed: 0,0,1,2
0,1.0,2.4,0.35
1,0.5,,2.7
2,,,
3,,6.5,2.1


Keeping row/columns with a minimum number of observations: <b>.dropna</b>(<i>thresh=n</i>)

In [39]:
df2.dropna(axis=1, thresh=3)

Unnamed: 0,2
0,0.35
1,2.7
2,
3,2.1


<div class="alert alert-block alert-info" style="margin-top: 20px">

<h3>Filling in Missing Data </h3>
    
Rather than filtering out missing data, you may want to fill in the "holes" with a constant value or by the previous value ('ffill') or the next value ('bfill'). This method is only valid for columns.

In [43]:
df2b = df2.copy()

In [40]:
df2.fillna(0)

Unnamed: 0,0,1,2
0,1.0,2.4,0.35
1,0.5,0.0,2.7
2,0.0,0.0,0.0
3,0.0,6.5,2.1


In [41]:
df2

Unnamed: 0,0,1,2
0,1.0,2.4,0.35
1,0.5,,2.7
2,,,
3,,6.5,2.1


In [44]:
df2b.fillna(0, inplace=True)
df2b

Unnamed: 0,0,1,2
0,1.0,2.4,0.35
1,0.5,0.0,2.7
2,0.0,0.0,0.0
3,0.0,6.5,2.1


In [46]:
#limit the number of replacements
df2.fillna(axis=0, method='ffill', limit=1)

Unnamed: 0,0,1,2
0,1.0,2.4,0.35
1,0.5,2.4,2.7
2,0.5,,2.7
3,,6.5,2.1


In [52]:
df2.fillna(df2.mean())

Unnamed: 0,0,1,2
0,1.0,2.4,0.35
1,0.5,4.45,2.7
2,0.75,4.45,1.716667
3,0.75,6.5,2.1


<div class="alert alert-block alert-info" style="margin-top: 20px">

<h3> Data manipulation </h3>
<p>


In [None]:
df.describe(include = "all")