<h2 align="center">  Machine Lerning Predição De Preços De Carros<h2>

#### Aspirante a Cientista de Dados Jr.: Karina Gonçalves Soares

#### Links de estudo:

* [Projeto de Aprendizado de máquina](https://medium.com/@furkankizilay/end-to-end-machine-learning-project-using-fastapi-streamlit-and-docker-6fda32d25c5d)
* [Repositório](https://github.com/EddyGiusepe/FastAPI/blob/main/4_End-to-End_ML_FastAPI_Streamlit_Docker/Machine_Learning_to_Cars.ipynb)

#### Neste artigo, desenvolveremos um projeto de aprendizado de máquina de ponta a ponta com 11 etapas em nosso próprio local. Após isso, vamos criar uma API com FastAPI, depois de criar a interface com o auxílio do Streamlit, vamos dockerizar nosso projeto.

#### 1. Carregamento de dados
#### 2. Engenharia de recursos
#### 3. Limpeza de dados
#### 4. Remoção de Outliers
#### 5. Visualização de dados (relação entre variáveis)
#### 6. Extraindo dados de treinamento
#### 7. Codificação
#### 8. Construir um modelo
#### 9. Criar API com FastAPI
#### 10. Criando uma interface web para o modelo criado com Streamlit.
#### 11. Dockerize


In [1]:
#%pip install pandas 

In [2]:
import pandas as pd

In [3]:
# Abrindo arquivo para leitura

df = pd.read_csv('cars.csv')
df.head()

Unnamed: 0,name,year,Price,kms_driven,fuel_type
0,Hyundai Santro Xing XO eRLX Euro III,2007,80000,"45,000 kms",Petrol
1,Mahindra Jeep CL550 MDI,2006,425000,40 kms,Diesel
2,Maruti Suzuki Alto 800 Vxi,2018,Ask For Price,"22,000 kms",Petrol
3,Hyundai Grand i10 Magna 1.2 Kappa VTVT,2014,325000,"28,000 kms",Petrol
4,Ford EcoSport Titanium 1.5L TDCi,2014,575000,"36,000 kms",Diesel


## Engenharia de recursos
#### Adicionar novo recurso para o nomee da empresa.

In [4]:
df["company"] = df.name.apply(lambda x: x.split(" ")[0])
df.head()

Unnamed: 0,name,year,Price,kms_driven,fuel_type,company
0,Hyundai Santro Xing XO eRLX Euro III,2007,80000,"45,000 kms",Petrol,Hyundai
1,Mahindra Jeep CL550 MDI,2006,425000,40 kms,Diesel,Mahindra
2,Maruti Suzuki Alto 800 Vxi,2018,Ask For Price,"22,000 kms",Petrol,Maruti
3,Hyundai Grand i10 Magna 1.2 Kappa VTVT,2014,325000,"28,000 kms",Petrol,Hyundai
4,Ford EcoSport Titanium 1.5L TDCi,2014,575000,"36,000 kms",Diesel,Ford


## Limpeza de dados
#### year tem muitos valores não anuais.

In [5]:
df.info('year')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 892 entries, 0 to 891
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   name        892 non-null    object
 1   year        892 non-null    object
 2   Price       892 non-null    object
 3   kms_driven  840 non-null    object
 4   fuel_type   837 non-null    object
 5   company     892 non-null    object
dtypes: object(6)
memory usage: 41.9+ KB


In [6]:
# Gerando valores aleatórios para análisar o DataFrame.

visualizando_valores_aleatorios = df.sample(n=10, replace=False)
print(visualizando_valores_aleatorios)

                                            name  year          Price   
519              Maruti Suzuki Wagon R VXI BS IV  2017       3,75,000  \
364                     Mahindra Jeep MM 550 XDB  2019       3,90,000   
430                   Toyota Fortuner 3.0 4x4 MT  2010       9,40,000   
170                   Toyota Corolla Altis 1.8 J  2015       5,75,000   
304                           Tata Indica eV2 LS  2017  Ask For Price   
336        Renault Duster 110 PS RxZ Diesel Plus  2012       5,01,000   
303  Mahindra Scorpio VLX Special Edition BS III  2004       2,30,000   
760         Chevrolet Tavera LS B3 10 Seats BSII  2005       1,30,000   
419                      Hyundai i20 Magna O 1.2  2013       3,10,000   
819                          Tata Bolt XM Petrol  2015       6,00,000   

       kms_driven fuel_type    company  
519    23,000 kms    Petrol     Maruti  
364        60 kms    Diesel   Mahindra  
430  1,31,000 kms    Diesel     Toyota  
170    42,000 kms    Petrol     

In [7]:
# Observamos que a SÉRIE --> "year" tem valores que não são anos (year). Vamos eliminar eles:

df['year']. value_counts()


year
2015    117
2014     94
2013     94
2016     76
2012     75
       ... 
ture      1
emi       1
able      1
no.       1
zest      1
Name: count, Length: 61, dtype: int64

In [8]:
# Podemos verificar também assim:
df['year'].unique()

array(['2007', '2006', '2018', '2014', '2015', '2012', '2013', '2016',
       '2010', '2017', '2008', '2011', '2019', '2009', '2005', '2000',
       '...', '150k', 'TOUR', '2003', 'r 15', '2004', 'Zest', '/-Rs',
       'sale', '1995', 'ara)', '2002', 'SELL', '2001', 'tion', 'odel',
       '2 bs', 'arry', 'Eon', 'o...', 'ture', 'emi', 'car', 'able', 'no.',
       'd...', 'SALE', 'digo', 'sell', 'd Ex', 'n...', 'e...', 'D...',
       ', Ac', 'go .', 'k...', 'o c4', 'zire', 'cent', 'Sumo', 'cab',
       't xe', 'EV2', 'r...', 'zest'], dtype=object)

In [9]:
print('Verificando o Type: ', type(df['year'].iloc[0]))

Verificando o Type:  <class 'str'>


In [10]:
# Trabalhamos com uma copia:
df2 = df.copy()

In [11]:
# Criando um novo Data Frame apenas com as linhas desejadas
# True, se todos os carateres da string forem numéricos:


df2 = df2[df2["year"].str.isnumeric()] 
df2

Unnamed: 0,name,year,Price,kms_driven,fuel_type,company
0,Hyundai Santro Xing XO eRLX Euro III,2007,80000,"45,000 kms",Petrol,Hyundai
1,Mahindra Jeep CL550 MDI,2006,425000,40 kms,Diesel,Mahindra
2,Maruti Suzuki Alto 800 Vxi,2018,Ask For Price,"22,000 kms",Petrol,Maruti
3,Hyundai Grand i10 Magna 1.2 Kappa VTVT,2014,325000,"28,000 kms",Petrol,Hyundai
4,Ford EcoSport Titanium 1.5L TDCi,2014,575000,"36,000 kms",Diesel,Ford
...,...,...,...,...,...,...
886,Toyota Corolla Altis,2009,300000,"1,32,000 kms",Petrol,Toyota
888,Tata Zest XM Diesel,2018,260000,"27,000 kms",Diesel,Tata
889,Mahindra Quanto C8,2013,390000,"40,000 kms",Diesel,Mahindra
890,Honda Amaze 1.2 E i VTEC,2014,180000,Petrol,,Honda


In [12]:
# Convertendo a coluna year para inteiro:

df2['year'] = df2['year'].astype(int)

In [13]:
# Agora sim, todos são números:

print("vejamos novamente: ", df2["year"].unique())

vejamos novamente:  [2007 2006 2018 2014 2015 2012 2013 2016 2010 2017 2008 2011 2019 2009
 2005 2000 2003 2004 1995 2002 2001]


In [18]:
# Eliminamos na coluna/série "Price" a palavra: Ask For Price, assim:

df2 = df2[df2["Price"] != "Ask For Price"]
df2.Price

0        80,000
1      4,25,000
3      3,25,000
4      5,75,000
6      1,75,000
         ...   
837    3,00,000
838    2,60,000
839    3,90,000
840    1,80,000
841    1,60,000
Name: Price, Length: 819, dtype: object

In [19]:
# Fazemos os reset do index:

df2 = df2.reset_index(drop=True)
df2.head()

Unnamed: 0,name,year,Price,kms_driven,fuel_type,company
0,Hyundai Santro Xing XO eRLX Euro III,2007,80000,"45,000 kms",Petrol,Hyundai
1,Mahindra Jeep CL550 MDI,2006,425000,40 kms,Diesel,Mahindra
2,Hyundai Grand i10 Magna 1.2 Kappa VTVT,2014,325000,"28,000 kms",Petrol,Hyundai
3,Ford EcoSport Titanium 1.5L TDCi,2014,575000,"36,000 kms",Diesel,Ford
4,Ford Figo,2012,175000,"41,000 kms",Diesel,Ford


In [20]:
# Vejamos o método .info():
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 819 entries, 0 to 818
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   name        819 non-null    object
 1   year        819 non-null    int64 
 2   Price       819 non-null    object
 3   kms_driven  819 non-null    object
 4   fuel_type   816 non-null    object
 5   company     819 non-null    object
dtypes: int64(1), object(5)
memory usage: 38.5+ KB


In [21]:
# A coluna Price tem vírgulas em seus preços e está como object:
df2.Price = df2.Price.str.replace(",", "").astype(int) # No final converte para inteiro

df2.head()

Unnamed: 0,name,year,Price,kms_driven,fuel_type,company
0,Hyundai Santro Xing XO eRLX Euro III,2007,80000,"45,000 kms",Petrol,Hyundai
1,Mahindra Jeep CL550 MDI,2006,425000,40 kms,Diesel,Mahindra
2,Hyundai Grand i10 Magna 1.2 Kappa VTVT,2014,325000,"28,000 kms",Petrol,Hyundai
3,Ford EcoSport Titanium 1.5L TDCi,2014,575000,"36,000 kms",Diesel,Ford
4,Ford Figo,2012,175000,"41,000 kms",Diesel,Ford


In [23]:
# Verificando o type da coluna Price
type(df2["Price"].iloc[0])

numpy.int64

In [24]:
# A coluna "kms_driven" tem valores de objeto com kms no passado:
print(df2[df2["kms_driven"].isna()])

Empty DataFrame
Columns: [name, year, Price, kms_driven, fuel_type, company]
Index: []


In [25]:
# Como observamos, não tem valores nan, mas tem duas linhas 'Petrol':
df2["kms_driven"].isnull().sum()

0

In [27]:
# Assim verificamos, também, que não temos valores MISSING (NaN):
df2[df2["kms_driven"].isna()]

Unnamed: 0,name,year,Price,kms_driven,fuel_type,company


In [28]:
# Este trecho de código pega apenas a PRIMEIRA posição de string (get(0)) e logo substitue a vírgula por nada:
df2["kms_driven"] = df2["kms_driven"].str.split(" ").str.get(0).str.replace(',', "")

In [29]:
# comparando df2 com o df2 anterior e vendo a limpeza nos Dados:
df2.head()


Unnamed: 0,name,year,Price,kms_driven,fuel_type,company
0,Hyundai Santro Xing XO eRLX Euro III,2007,80000,45000,Petrol,Hyundai
1,Mahindra Jeep CL550 MDI,2006,425000,40,Diesel,Mahindra
2,Hyundai Grand i10 Magna 1.2 Kappa VTVT,2014,325000,28000,Petrol,Hyundai
3,Ford EcoSport Titanium 1.5L TDCi,2014,575000,36000,Diesel,Ford
4,Ford Figo,2012,175000,41000,Diesel,Ford


In [30]:
type(df2['kms_driven'].iloc[0])

str

In [31]:
# Logo convertemos a numérico: 
df2 = df2[df2["kms_driven"].str.isnumeric()]

In [None]:
pd.options.mode.chained_assignment = None  # desabilitando o aviso

In [32]:
df2["kms_driven"] = df2["kms_driven"].astype(int) 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2["kms_driven"] = df2["kms_driven"].astype(int)
