<h2 align="center">  Machine Lerning Predição De Preços De Carros<h2>

#### Aspirante a Cientista de Dados Jr.: Karina Gonçalves Soares

#### Links de estudo:

* [Projeto de Aprendizado de máquina](https://medium.com/@furkankizilay/end-to-end-machine-learning-project-using-fastapi-streamlit-and-docker-6fda32d25c5d)
* [Repositório](https://github.com/EddyGiusepe/FastAPI/blob/main/4_End-to-End_ML_FastAPI_Streamlit_Docker/Machine_Learning_to_Cars.ipynb)

#### Neste artigo, desenvolveremos um projeto de aprendizado de máquina de ponta a ponta com 11 etapas em nosso próprio local. Após isso, vamos criar uma API com FastAPI, depois de criar a interface com o auxílio do Streamlit, vamos dockerizar nosso projeto.

#### 1. Carregamento de dados
#### 2. Engenharia de recursos
#### 3. Limpeza de dados
#### 4. Remoção de Outliers
#### 5. Visualização de dados (relação entre variáveis)
#### 6. Extraindo dados de treinamento
#### 7. Codificação
#### 8. Construir um modelo
#### 9. Criar API com FastAPI
#### 10. Criando uma interface web para o modelo criado com Streamlit.
#### 11. Dockerize


In [1]:
#%pip install pandas 

In [2]:
import pandas as pd

In [3]:
# Abrindo arquivo para leitura

df = pd.read_csv('cars.csv')
df.head()

Unnamed: 0,name,year,Price,kms_driven,fuel_type,company
0,Hyundai Santro Xing,2007,80000,45000,Petrol,Hyundai
1,Mahindra Jeep CL550,2006,425000,40,Diesel,Mahindra
2,Hyundai Grand i10,2014,325000,28000,Petrol,Hyundai
3,Ford EcoSport Titanium,2014,575000,36000,Diesel,Ford
4,Ford Figo,2012,175000,41000,Diesel,Ford


## Engenharia de recursos
#### Adicionar novo recurso para o nomee da empresa.

In [4]:
df["company"] = df.name.apply(lambda x: x.split(" ")[0])
df.head()

Unnamed: 0,name,year,Price,kms_driven,fuel_type,company
0,Hyundai Santro Xing,2007,80000,45000,Petrol,Hyundai
1,Mahindra Jeep CL550,2006,425000,40,Diesel,Mahindra
2,Hyundai Grand i10,2014,325000,28000,Petrol,Hyundai
3,Ford EcoSport Titanium,2014,575000,36000,Diesel,Ford
4,Ford Figo,2012,175000,41000,Diesel,Ford


## Limpeza de dados
#### year tem muitos valores não anuais.

In [5]:
df.info('year')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 816 entries, 0 to 815
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   name        816 non-null    object
 1   year        816 non-null    int64 
 2   Price       816 non-null    int64 
 3   kms_driven  816 non-null    int64 
 4   fuel_type   816 non-null    object
 5   company     816 non-null    object
dtypes: int64(3), object(3)
memory usage: 38.4+ KB


In [6]:
# Gerando valores aleatórios para análisar o DataFrame.

visualizando_valores_aleatorios = df.sample(n=10, replace=False)
print(visualizando_valores_aleatorios)

                       name  year   Price  kms_driven fuel_type   company
494         Honda Amaze 1.2  2014  381000        6000    Petrol     Honda
584   Skoda Octavia Classic  2006  114990       65000    Diesel     Skoda
590    Mahindra Scorpio 2.6  2007  260000       56000    Diesel  Mahindra
454            Tata Zest XE  2018  480000      103553    Diesel      Tata
297             Honda Amaze  2014  299999       37000    Petrol     Honda
121      Maruti Suzuki Alto  2017  350000        5600    Petrol    Maruti
720  Ford EcoSport Titanium  2014  590000       34000    Diesel      Ford
467    Maruti Suzuki Vitara  2017  725000       36000    Diesel    Maruti
351    Renault Duster 110PS  2012  501000       35000    Diesel   Renault
207    Renault Duster 110PS  2012  501000       35000    Diesel   Renault


In [7]:
# Observamos que a SÉRIE --> "year" tem valores que não são anos (year). Vamos eliminar eles:

df['year']. value_counts()


year
2015    111
2013     94
2014     92
2012     75
2016     74
2011     59
2009     54
2017     53
2010     43
2018     30
2006     22
2007     19
2019     18
2008     16
2005     13
2003     13
2004     12
2000      7
2001      5
2002      4
1995      2
Name: count, dtype: int64

In [8]:
# Podemos verificar também assim:
df['year'].unique()

array([2007, 2006, 2014, 2012, 2013, 2016, 2015, 2010, 2017, 2008, 2018,
       2011, 2019, 2009, 2005, 2000, 2003, 2004, 1995, 2002, 2001])

In [9]:
print('Verificando o Type: ', type(df['year'].iloc[0]))

Verificando o Type:  <class 'numpy.int64'>


In [10]:
# Trabalhamos com uma copia:
df2 = df.copy()

In [11]:
# Criando um novo Data Frame apenas com as linhas desejadas
# True, se todos os carateres da string forem numéricos:


df2 = df2[df2["year"].str.isnumeric()] 
df2

AttributeError: Can only use .str accessor with string values!

In [None]:
# Convertendo a coluna year para inteiro:

df2['year'] = df2['year'].astype(int)

In [None]:
# Agora sim, todos são números:

print("vejamos novamente: ", df2["year"].unique())

vejamos novamente:  [2007 2006 2018 2014 2015 2012 2013 2016 2010 2017 2008 2011 2019 2009
 2005 2000 2003 2004 1995 2002 2001]


In [None]:
# Eliminamos na coluna/série "Price" a palavra: Ask For Price, assim:

df2 = df2[df2["Price"] != "Ask For Price"]
df2.Price

0        80,000
1      4,25,000
3      3,25,000
4      5,75,000
6      1,75,000
         ...   
837    3,00,000
838    2,60,000
839    3,90,000
840    1,80,000
841    1,60,000
Name: Price, Length: 819, dtype: object

In [None]:
# Fazemos os reset do index:

df2 = df2.reset_index(drop=True)
df2.head()

Unnamed: 0,name,year,Price,kms_driven,fuel_type,company
0,Hyundai Santro Xing XO eRLX Euro III,2007,80000,"45,000 kms",Petrol,Hyundai
1,Mahindra Jeep CL550 MDI,2006,425000,40 kms,Diesel,Mahindra
2,Hyundai Grand i10 Magna 1.2 Kappa VTVT,2014,325000,"28,000 kms",Petrol,Hyundai
3,Ford EcoSport Titanium 1.5L TDCi,2014,575000,"36,000 kms",Diesel,Ford
4,Ford Figo,2012,175000,"41,000 kms",Diesel,Ford


In [None]:
# Vejamos o método .info():
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 819 entries, 0 to 818
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   name        819 non-null    object
 1   year        819 non-null    int64 
 2   Price       819 non-null    object
 3   kms_driven  819 non-null    object
 4   fuel_type   816 non-null    object
 5   company     819 non-null    object
dtypes: int64(1), object(5)
memory usage: 38.5+ KB


In [None]:
# A coluna Price tem vírgulas em seus preços e está como object:
df2.Price = df2.Price.str.replace(",", "").astype(int) # No final converte para inteiro

df2.head()

Unnamed: 0,name,year,Price,kms_driven,fuel_type,company
0,Hyundai Santro Xing XO eRLX Euro III,2007,80000,"45,000 kms",Petrol,Hyundai
1,Mahindra Jeep CL550 MDI,2006,425000,40 kms,Diesel,Mahindra
2,Hyundai Grand i10 Magna 1.2 Kappa VTVT,2014,325000,"28,000 kms",Petrol,Hyundai
3,Ford EcoSport Titanium 1.5L TDCi,2014,575000,"36,000 kms",Diesel,Ford
4,Ford Figo,2012,175000,"41,000 kms",Diesel,Ford


In [None]:
# Verificando o type da coluna Price
type(df2["Price"].iloc[0])

numpy.int64

In [None]:
# A coluna "kms_driven" tem valores de objeto com kms no passado:
print(df2[df2["kms_driven"].isna()])

Empty DataFrame
Columns: [name, year, Price, kms_driven, fuel_type, company]
Index: []


In [None]:
# Como observamos, não tem valores nan, mas tem duas linhas 'Petrol':
df2["kms_driven"].isnull().sum()

0

In [None]:
# Assim verificamos, também, que não temos valores MISSING (NaN):
df2[df2["kms_driven"].isna()]

Unnamed: 0,name,year,Price,kms_driven,fuel_type,company


In [None]:
# Este trecho de código pega apenas a PRIMEIRA posição de string (get(0)) e logo substitue a vírgula por nada:
df2["kms_driven"] = df2["kms_driven"].str.split(" ").str.get(0).str.replace(',', "")

In [None]:
# comparando df2 com o df2 anterior e vendo a limpeza nos Dados:
df2.head()


Unnamed: 0,name,year,Price,kms_driven,fuel_type,company
0,Hyundai Santro Xing XO eRLX Euro III,2007,80000,45000,Petrol,Hyundai
1,Mahindra Jeep CL550 MDI,2006,425000,40,Diesel,Mahindra
2,Hyundai Grand i10 Magna 1.2 Kappa VTVT,2014,325000,28000,Petrol,Hyundai
3,Ford EcoSport Titanium 1.5L TDCi,2014,575000,36000,Diesel,Ford
4,Ford Figo,2012,175000,41000,Diesel,Ford


In [None]:
type(df2['kms_driven'].iloc[0])

str

In [None]:
# Logo convertemos a numérico: 
df2 = df2[df2["kms_driven"].str.isnumeric()]

In [None]:
pd.options.mode.chained_assignment = None  # desabilitando o aviso

In [None]:
df2["kms_driven"] = df2["kms_driven"].astype(int) 

In [None]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 817 entries, 0 to 816
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   name        817 non-null    object
 1   year        817 non-null    int64 
 2   Price       817 non-null    int64 
 3   kms_driven  817 non-null    int64 
 4   fuel_type   816 non-null    object
 5   company     817 non-null    object
dtypes: int64(3), object(3)
memory usage: 44.7+ KB


In [None]:
# A coluna 'fuel_type' tem valores nan:
print("Na coluna 'fuel_type' observamos um valor NaN: ")
df2[df2["fuel_type"].isna()]

Na coluna 'fuel_type' observamos um valor NaN: 


Unnamed: 0,name,year,Price,kms_driven,fuel_type,company
128,Toyota Corolla,2009,275000,26000,,Toyota


In [None]:
# Selecionamos as linhas que não tenham NaN:
df2 = df2[~df2["fuel_type"].isna()]
df2.head()

Unnamed: 0,name,year,Price,kms_driven,fuel_type,company
0,Hyundai Santro Xing XO eRLX Euro III,2007,80000,45000,Petrol,Hyundai
1,Mahindra Jeep CL550 MDI,2006,425000,40,Diesel,Mahindra
2,Hyundai Grand i10 Magna 1.2 Kappa VTVT,2014,325000,28000,Petrol,Hyundai
3,Ford EcoSport Titanium 1.5L TDCi,2014,575000,36000,Diesel,Ford
4,Ford Figo,2012,175000,41000,Diesel,Ford


In [None]:
# Mudança de nomes de carros. Mantendo apenas as três primeiras palavras:
df2['name'] = df2['name'].str.split().str.slice(start=0, stop=3).str.join(" ")
df2.head()

Unnamed: 0,name,year,Price,kms_driven,fuel_type,company
0,Hyundai Santro Xing,2007,80000,45000,Petrol,Hyundai
1,Mahindra Jeep CL550,2006,425000,40,Diesel,Mahindra
2,Hyundai Grand i10,2014,325000,28000,Petrol,Hyundai
3,Ford EcoSport Titanium,2014,575000,36000,Diesel,Ford
4,Ford Figo,2012,175000,41000,Diesel,Ford


Redefinindo o índice dos dados finais limpos:

In [None]:
df2 = df2.reset_index(drop=True)

Salvando nossos Dados Limpos:

In [None]:
df2.to_csv('cars.csv', index=False)

In [None]:
df2.describe(include="all")

Unnamed: 0,name,year,Price,kms_driven,fuel_type,company
count,816,816.0,816.0,816.0,816,816
unique,254,,,,3,25
top,Maruti Suzuki Swift,,,,Petrol,Maruti
freq,51,,,,428,221
mean,,2012.444853,411717.6,46275.531863,,
std,,4.002992,475184.4,34297.428044,,
min,,1995.0,30000.0,0.0,,
25%,,2010.0,175000.0,27000.0,,
50%,,2013.0,299999.0,41000.0,,
75%,,2015.0,491250.0,56818.5,,


## Remoção de Outliers

Eliminamos os preços discrepantes:

6e6 representa 6 milhões em notação científica (6 vezes 10 elevado a 6).
Estamos eliminando as linhas da coluna Price onde o valor é menor que 6 milhoẽs e estamos gerando um novo índice.

In [12]:
df2 = df2[df2['Price']<6e6].reset_index(drop=True)

df2

Unnamed: 0,name,year,Price,kms_driven,fuel_type,company
0,Hyundai Santro Xing,2007,80000,45000,Petrol,Hyundai
1,Mahindra Jeep CL550,2006,425000,40,Diesel,Mahindra
2,Hyundai Grand i10,2014,325000,28000,Petrol,Hyundai
3,Ford EcoSport Titanium,2014,575000,36000,Diesel,Ford
4,Ford Figo,2012,175000,41000,Diesel,Ford
...,...,...,...,...,...,...
810,Maruti Suzuki Ritz,2011,270000,50000,Petrol,Maruti
811,Tata Indica V2,2009,110000,30000,Diesel,Tata
812,Toyota Corolla Altis,2009,300000,132000,Petrol,Toyota
813,Tata Zest XM,2018,260000,27000,Diesel,Tata
