# <font color='White'>Mini projeto Numpy</font>

<font color='white'>Objetivo do projeto é fazer a limpeza e o pré-processamento de dados usando apenas o NumPy utilizando uma base de dados indiana que mostra o numero de visitas em varios monumentos locais</font>

### Descricao da base de dados
<font color='White'>Mostra o número de visitantes que visitaram os monumentos centrais entre 2019 e 2021. PT</font><BR>
<font color='white'>It shows the number of visitors who visited the central monuments between 2019 and 2021. ENG</font>

In [5]:
# Dicionario de dados
"""Circle - City
Name of the Monument
Domestic-2019-2020: Number of people from India who visited the monument in 2019-20.
Foreign-2019-20: Number of foreigners who visited the monument in 2019-20.
Domestic-2020-21: Number of people from India who visited the monument in 2020-21.
Foreign-2020-2021: Number of foreigners who visited the monument in 2020-21.
% Growth 2021-21/2019-20-Domestic: Growth percentage of domestic visitors from 2019 to 2021.
% Growth 2021-21/2019-20-Foreign: Growth percentage of foreign visitors from 2019 to 2021."""

'Circle - City\nName of the Monument\nDomestic-2019-2020: Number of people from India who visited the monument in 2019-20.\nForeign-2019-20: Number of foreigners who visited the monument in 2019-20.\nDomestic-2020-21: Number of people from India who visited the monument in 2020-21.\nForeign-2020-2021: Number of foreigners who visited the monument in 2020-21.\n% Growth 2021-21/2019-20-Domestic: Growth percentage of domestic visitors from 2019 to 2021.\n% Growth 2021-21/2019-20-Foreign: Growth percentage of foreign visitors from 2019 to 2021.'

### Importando as bibliotecas

In [6]:
# Import
import numpy as np

# Configuração de impressão do NumPy
np.set_printoptions(suppress = True, linewidth = 200, precision = 2)

### Carregando o dataset Turismo Indiano

In [7]:
# Carrega as colunas do tipo string
df01 = np.genfromtxt("India-Tourism-2021.csv",
                            delimiter = ';',
                            skip_header = 1,
                            autostrip = True,
                            encoding = 'cp1252')
df01.view()                        

array([[        nan,         nan,  4429710.  , ...,      -71.56,      -98.6 ,         nan],
       [        nan,         nan,  1627154.  , ...,      -77.18,      -99.27,         nan],
       [        nan,         nan,   454376.  , ...,      -76.27,      -99.69,         nan],
       ...,
       [        nan,         nan,    22353.  , ...,      -54.28,      -88.24,         nan],
       [        nan,         nan,   262949.  , ...,      -55.84,      -86.41,         nan],
       [        nan,         nan, 43607075.  , ...,      -69.84,      -84.91,         nan]])

In [8]:
df01.shape

(178, 9)

### Tratando valores ausentes

In [9]:
# Observe como várias colunas estão com o tipo nan. 
# Isso se deve a caracteres especiais no conjunto de dados e a forma como o NumPy carrega dados numéricos e do tipo string. Vamos resolver isso.
np.isnan(df01).sum()

537

In [10]:
# Vamos retornar o maior valor + 1 ignorando valores nan
# Usaremos esse valor arbitrário para preencher os valores ausentes no momento da carga de dados de variáveis
# numéricas e depois tratamos esse valor como valor ausente
valor_nulo = np.nanmax(df01) + 1
print(valor_nulo)

43607076.0


In [11]:
# Calculamos a média (variáveis numéricas) ignorando valores nan por coluna
# Usaremos isso para separar variáveis numéricas de variáveis do tipo string
media_ignorando_nan = np.nanmean(df01, axis = 0)
print(media_ignorando_nan)

[      nan       nan 734950.7   46458.89 221681.06   7008.86    -28.23    937.42       nan]


  This is separate from the ipykernel package so we can avoid doing imports until


In [12]:
# Separando colunas do tipo string com valores ausentes
colunas_strings = np.argwhere(np.isnan(media_ignorando_nan)).squeeze()
colunas_strings

array([0, 1, 8], dtype=int64)

In [13]:
# Separando as colunas numéricas 
colunas_numericas = np.argwhere(np.isnan(media_ignorando_nan) == False).squeeze()
colunas_numericas

array([2, 3, 4, 5, 6, 7], dtype=int64)

### Carregando o dataset Turismo Indiano - Somente string

In [14]:
# Carrega novamente os dados das colunas do tipo string
strings = np.genfromtxt("India-Tourism-2021.csv",
                            delimiter = ';',
                            skip_header = 1,
                            autostrip = True, 
                            usecols = colunas_strings,
                            dtype = str, 
                            encoding = 'cp1252')
strings

array([['Agra', 'Taj Mahal', 'excellent'],
       ['Agra', 'Agra Fort', 'good'],
       ['Agra', 'Fatehpur Sikri', 'bad'],
       ['Agra', 'Akbar Tomb Sikandra', 'bad'],
       ['Agra', 'Mariam tomb Sikandra', 'bad'],
       ['Agra', 'Itimad-ud-Daulah-Tomb', 'excellent'],
       ['Agra', 'Ram Bagh', 'bad'],
       ['Agra', 'Mehtab Bagh', 'bad'],
       ['Agra', 'Mausoleum', 'bad'],
       ['Total', 'Total', 'excellent'],
       ['Lucknow', 'Site of Sahet mahet', 'bad'],
       ['Lucknow', 'Residency Building', 'bad'],
       ['Lucknow', 'Piprahwa & Ganwaria', 'excellent'],
       ['Total', 'Total', 'bad'],
       ['Jhansi', 'Gupta Temple & Varah Temple  Deogarh', 'bad'],
       ['Jhansi', 'Kalinjar Fort', 'good'],
       ['Jhansi', 'Rani Lakshmi Bai Mahal', 'good'],
       ['Jhansi', 'Rani Jhansi Fort', 'bad'],
       ['Total', 'Total', 'excellent'],
       ['Sarnath', 'Lord Cornwallis Tomb', 'excellent'],
       ['Sarnath', 'Old Fort (Shahi Fort)  Jaunpur', 'bad'],
       ['Sarnath', 

### Carregando o dataset Turismo Indiano - Somente Numerico

In [15]:
# Carrega as colunas do tipo numérico preenchendo os valores ausentes
numeric = np.genfromtxt("India-Tourism-2021.csv",
                            delimiter = ';',
                            autostrip = True,
                            skip_header = 1,
                            usecols = colunas_numericas,
                            filling_values = valor_nulo, 
                            encoding = 'cp1252')
numeric

array([[ 4429710.  ,   645415.  ,  1259892.  ,     9034.  ,      -71.56,      -98.6 ],
       [ 1627154.  ,   386522.  ,   371242.  ,     2810.  ,      -77.18,      -99.27],
       [  454376.  ,   184751.  ,   107835.  ,      574.  ,      -76.27,      -99.69],
       ...,
       [   22353.  ,       85.  ,    10219.  ,       10.  ,      -54.28,      -88.24],
       [  262949.  ,     1008.  ,   116120.  ,      137.  ,      -55.84,      -86.41],
       [43607075.  ,  2756561.  , 13153076.  ,   415859.  ,      -69.84,      -84.91]])

### Carregando o dataset Turismo Indiano - Somente nome das colunas

In [16]:
# Carrega os nomes das colunas
arr_nomes_colunas = np.genfromtxt("India-Tourism-2021.csv",
                                  delimiter = ';',
                                  autostrip = True,
                                  skip_footer = df01.shape[0],
                                  dtype = str, 
                                  encoding = 'cp1252')
arr_nomes_colunas

array(['Circle', 'Name of the Monument', 'Domestic-2019-20', 'Foreign-2019-20', 'Domestic-2020-21', 'Foreign-2020-21', '% Growth 2021-21/2019-20-Domestic', '% Growth 2021-21/2019-20-Foreign',
       'assessment'], dtype='<U33')

In [17]:
# Separa cabeçalho de colunas numéricas e string
header_strings, header_numeric = arr_nomes_colunas[colunas_strings], arr_nomes_colunas[colunas_numericas]
header_strings, header_numeric

(array(['Circle', 'Name of the Monument', 'assessment'], dtype='<U33'),
 array(['Domestic-2019-20', 'Foreign-2019-20', 'Domestic-2020-21', 'Foreign-2020-21', '% Growth 2021-21/2019-20-Domestic', '% Growth 2021-21/2019-20-Foreign'], dtype='<U33'))

### Renomando as colunas

In [18]:
# Vamos ajustar o nome das colunas para facilitar a identificação
header_strings[0] = "Cidade"
header_strings[1] = "Monumento"
header_strings[2] = "Avaliacao"
header_strings

array(['Cidade', 'Monumento', 'Avaliacao'], dtype='<U33')

In [19]:
# Vamos ajustar o nome das colunas para facilitar a identificação
header_numeric[0] = "Domestico-2019-20"
header_numeric[1] = "Estrangeiro-2019-20"
header_numeric[2] = "Domestico-2020-21"
header_numeric[3] = "Estrangeiro-2020-21"
header_numeric[4] = "Crescimento-Domestico-21/19"
header_numeric[5] = "Crescimento-Estrangeiro-21/19"
header_numeric

array(['Domestico-2019-20', 'Estrangeiro-2019-20', 'Domestico-2020-21', 'Estrangeiro-2020-21', 'Crescimento-Domestico-21/19', 'Crescimento-Estrangeiro-21/19'], dtype='<U33')

### Pré-Processamento da Variável cidade com Label Encoding

In [20]:
# Extrai os valores únicos da variável
np.unique(strings[:,0])

array(['Agra', 'Amaravati', 'Aurangabad', 'Banglore', 'Bhopal', 'Bhubaneswar', 'Chandigarh', 'Chennai', 'Delhi', 'Dharwad', 'Goa', 'Grand Total', 'Guwahati', 'Hampi', 'Hyderabad', 'Jabalpur',
       'Jaipur', 'Jhansi', 'Jodhpur', 'Kolkata', 'Leh', 'Lucknow', 'Mumbai', 'Nagpur', 'Patna', 'Raiganj', 'Raipur', 'Rajkot', 'Sarnath', 'Shimla', 'Srinagar', 'Thrissur', 'Tiruchirappalli', 'Total',
       'Vadodara'], dtype='<U110')

In [21]:
# Mostra a quantidade de valores unicos 
cidades_numero = np.unique(strings[:,0])
cidades_numero.shape

(35,)

In [22]:
# Criamos um array com os meses (incluindo um elemento como vazio para o que estiver em branco)
cidades = np.array(['Agra', 'Amaravati', 'Aurangabad', 'Banglore', 'Bhopal', 'Bhubaneswar', 'Chandigarh', 'Chennai', 'Delhi', 'Dharwad', 'Goa', 'Grand Total', 'Guwahati', 'Hampi', 'Hyderabad', 'Jabalpur',
       'Jaipur', 'Jhansi', 'Jodhpur', 'Kolkata', 'Leh', 'Lucknow', 'Mumbai', 'Nagpur', 'Patna', 'Raiganj', 'Raipur', 'Rajkot', 'Sarnath', 'Shimla', 'Srinagar', 'Thrissur', 'Tiruchirappalli', 'Total',
       'Vadodara'])

In [23]:
# Loop para converter os nomes das cidades em valores numéricos
# Chamamos isso de label encoding
for i in range(35):
        strings[:,0] = np.where(strings[:,0] == cidades[i], i, strings[:,0])
strings[:,0]        

array(['0', '0', '0', '0', '0', '0', '0', '0', '0', '33', '21', '21', '21', '33', '17', '17', '17', '17', '33', '28', '28', '28', '28', '28', '28', '33', '31', '31', '31', '31', '33', '7', '7', '7',
       '33', '32', '32', '32', '32', '32', '32', '32', '33', '4', '4', '4', '4', '4', '4', '4', '4', '4', '4', '4', '33', '15', '15', '15', '33', '9', '9', '9', '9', '9', '9', '33', '13', '13', '33',
       '3', '3', '3', '3', '33', '25', '25', '33', '19', '19', '33', '27', '27', '27', '33', '34', '34', '34', '34', '33', '5', '5', '5', '5', '5', '33', '2', '2', '2', '2', '2', '2', '33', '22',
       '22', '22', '22', '22', '22', '22', '22', '22', '22', '22', '22', '22', '33', '23', '23', '33', '6', '6', '33', '8', '8', '8', '8', '8', '8', '8', '8', '8', '8', '8', '33', '12', '12', '12',
       '12', '12', '33', '10', '33', '14', '14', '14', '33', '16', '16', '16', '33', '18', '18', '18', '33', '20', '33', '24', '24', '24', '24', '24', '33', '26', '33', '29', '29', '33', '30', '30',
       '

In [24]:
# Extrai os valores únicos da variável
np.unique(strings[:,1])

array(['Aga  Khan Palace Building', 'Agra Fort', "Ahom Raja's Palace", 'Ajanta Caves', 'Akbar Tomb Sikandra', 'Amaravati mahastupa',
       'Ancient Buddhist Remains comprising monastery stupa  rock sculptures  inscriptions ect Mansur', 'Ancient Buddhist Site known as Chaukhandi stupa', 'Ancient Palace Leh',
       'Ancient Remains on both Udaigiri & Khandagiri Hills', 'Ancient Site Bhangarh', 'Ancient Site and Adamgrah rock shelter  Kalamdi Rasuliya and kishanpur',
       'Ancient Site of Vikramshila Antichak', 'Ancient site of Vaishali  Kolhua', 'Asokan Rock Edict  Jungadh', 'Aurangabad Caves', 'Avantiswamin Temple  Avantipur  District Pulwama',
       'Baba Pyara Caves Junagadh & Khapra Khodiya Caves  Junagadh', 'Badal Mahal Gatwway  Chanderi', 'Baori at Abhaneri', 'Bekal Fort  Pallikkare  Distt. Kasargod', 'Bellary Fort',
       'Bir Singh Palace Datia', 'Bishnudol', 'Bishnupur Temples', 'Buddhish Caves  Junagadh', 'Buddhist Caves', 'Buddhist Caves Kanheri', 'Buddhist Monuments  Sa

In [25]:
# Mostra a quantidade de valores unicos 
monumento_numero = np.unique(strings[:,1])
monumento_numero.shape

(146,)

In [26]:
strings[:,1].shape

(178,)

In [27]:
strings.shape

(178, 3)

In [28]:
# Observamos que os monumentos aparecem mais de uma vez, então vamos associar cada monumento a um numero de 0 a 146
# Vamos excluir o TOTAL que aparece para cada cidade
lista = []
for i in range(178):
    if strings[:,1][i] == ("Total" or "Grand Total"):
        lista.append(i)
        #np.delete(strings[:,1], i, axis=0)
    #para_remover = np.where(strings[:,1][i] == "Total", i, " ")

    #np.delete(strings[:,1], para_remover, axis=0)
#strings[:,1]  



In [29]:
print(lista)
strings_sem_total = np.delete(strings, lista, axis=0)

[9, 13, 18, 25, 30, 34, 42, 54, 58, 65, 68, 73, 76, 79, 83, 88, 94, 101, 115, 118, 121, 133, 139, 141, 145, 149, 153, 155, 161, 163, 166, 170, 176]


In [30]:
strings_sem_total.shape


(145, 3)

In [31]:
strings_sem_total = np.delete(strings_sem_total, 144, axis=0)
strings_sem_total

array([['0', 'Taj Mahal', 'excellent'],
       ['0', 'Agra Fort', 'good'],
       ['0', 'Fatehpur Sikri', 'bad'],
       ['0', 'Akbar Tomb Sikandra', 'bad'],
       ['0', 'Mariam tomb Sikandra', 'bad'],
       ['0', 'Itimad-ud-Daulah-Tomb', 'excellent'],
       ['0', 'Ram Bagh', 'bad'],
       ['0', 'Mehtab Bagh', 'bad'],
       ['0', 'Mausoleum', 'bad'],
       ['21', 'Site of Sahet mahet', 'bad'],
       ['21', 'Residency Building', 'bad'],
       ['21', 'Piprahwa & Ganwaria', 'excellent'],
       ['17', 'Gupta Temple & Varah Temple  Deogarh', 'bad'],
       ['17', 'Kalinjar Fort', 'good'],
       ['17', 'Rani Lakshmi Bai Mahal', 'good'],
       ['17', 'Rani Jhansi Fort', 'bad'],
       ['28', 'Lord Cornwallis Tomb', 'excellent'],
       ['28', 'Old Fort (Shahi Fort)  Jaunpur', 'bad'],
       ['28', 'Observatory of Man Singh', 'excellent'],
       ['28', 'Excavated Remains at sarnath', 'good'],
       ['28', 'Tomb of Lal Khan', 'excellent'],
       ['28', 'Ancient Buddhist Site known

In [32]:
# Extrai os valores únicos da variável
np.unique(strings[:,1])

array(['Aga  Khan Palace Building', 'Agra Fort', "Ahom Raja's Palace", 'Ajanta Caves', 'Akbar Tomb Sikandra', 'Amaravati mahastupa',
       'Ancient Buddhist Remains comprising monastery stupa  rock sculptures  inscriptions ect Mansur', 'Ancient Buddhist Site known as Chaukhandi stupa', 'Ancient Palace Leh',
       'Ancient Remains on both Udaigiri & Khandagiri Hills', 'Ancient Site Bhangarh', 'Ancient Site and Adamgrah rock shelter  Kalamdi Rasuliya and kishanpur',
       'Ancient Site of Vikramshila Antichak', 'Ancient site of Vaishali  Kolhua', 'Asokan Rock Edict  Jungadh', 'Aurangabad Caves', 'Avantiswamin Temple  Avantipur  District Pulwama',
       'Baba Pyara Caves Junagadh & Khapra Khodiya Caves  Junagadh', 'Badal Mahal Gatwway  Chanderi', 'Baori at Abhaneri', 'Bekal Fort  Pallikkare  Distt. Kasargod', 'Bellary Fort',
       'Bir Singh Palace Datia', 'Bishnudol', 'Bishnupur Temples', 'Buddhish Caves  Junagadh', 'Buddhist Caves', 'Buddhist Caves Kanheri', 'Buddhist Monuments  Sa

In [33]:
monumento  = np.array(['Aga  Khan Palace Building', 'Agra Fort', "Ahom Raja's Palace", 'Ajanta Caves', 'Akbar Tomb Sikandra', 'Amaravati mahastupa',
       'Ancient Buddhist Remains comprising monastery stupa  rock sculptures  inscriptions ect Mansur', 'Ancient Buddhist Site known as Chaukhandi stupa', 'Ancient Palace Leh',
       'Ancient Remains on both Udaigiri & Khandagiri Hills', 'Ancient Site Bhangarh', 'Ancient Site and Adamgrah rock shelter  Kalamdi Rasuliya and kishanpur',
       'Ancient Site of Vikramshila Antichak', 'Ancient site of Vaishali  Kolhua', 'Asokan Rock Edict  Jungadh', 'Aurangabad Caves', 'Avantiswamin Temple  Avantipur  District Pulwama',
       'Baba Pyara Caves Junagadh & Khapra Khodiya Caves  Junagadh', 'Badal Mahal Gatwway  Chanderi', 'Baori at Abhaneri', 'Bekal Fort  Pallikkare  Distt. Kasargod', 'Bellary Fort',
       'Bir Singh Palace Datia', 'Bishnudol', 'Bishnupur Temples', 'Buddhish Caves  Junagadh', 'Buddhist Caves', 'Buddhist Caves Kanheri', 'Buddhist Monuments  Sanchi',
       'Buddhist Remains on hill top at Guntupalli  W.G.District', 'Buddhist cave no 01 to 51 Dhamnar  Tehsil Garoth', 'Cave  Temple and Inscriptions  Junaar  Lenyadri',
       'Cave Temple & Inscriptions  Bhaja', 'Caves  Temples and inscriptions Karla', 'Caves 1 t0 20 Udaygiri Vidisha', 'Champaner Monuments  Pavagadh', 'Chandragiri Monument', 'Charminar',
       'Chittaurgarh Fort', 'Cooch Bihar Palace', 'Dariya Daulath Bagh', 'Daulatabad Fort', 'Deeg Bhawan', 'Durga temple complex Aihole', 'Elephanta Caves', 'Ellora Caves',
       'Excavated Remains at Nalanda', 'Excavated Remains at sarnath', 'Fatehpur Sikri', 'Fort  Palakkad  Palakkad', 'Fort Museum  Thirumayam', 'Fort St. Angelo  Kannur', 'Fort Vattakottai',
       'Fort on Rock  Dindigual', 'Fortress and Temple Chitrudurga Fort', 'Gawilgarh Fort', 'Gingee Fort   Gingee', 'Gol Gumbaz  Vijayapura', 'Golconda', 'Grand Total', 'Group of Monuments  Hampi',
       'Group of Monuments (WH) Pattadakal', 'Group of Monuments Mamallapuram', 'Group of Temple Parameshvar shiv and Karan Temple  Amarkantak', 'Group of Temples at kiramchi  District Udhampur',
       'Group of four MaidansCharaideo  Sibasagar', 'Group of monument  Royal Palace Mandu', 'Gupta Temple & Varah Temple  Deogarh', 'Gwalior Fort', 'Hauzkhas', 'Hazarduari Palace',
       "Hoshang Shah's Tomb", 'Humayun Tomb', 'Ibrahim Rauza  Bijapur', 'Itimad-ud-Daulah-Tomb', 'Jaina & Vaishnava Cave  Badami', 'Janjira Fort  Murd', 'Jantar Mantar', 'Kalinjar Fort',
       'Kareghar of Ahom Kings  Sibasagar', 'Keshava Temple', 'Khan-I-Khana', 'Kolaba Fort  Alibag', 'Kondiote Caves', 'Kotla Feroz Shah', 'Kumbhalgarh Fort', 'Lohgad Fort', 'Lord Cornwallis Tomb',
       'Marble Pavillion and balustrade on the Ana Sagar bund and ruins of the marble Hammam Behind the Ana sagar Bund', 'Mariam tomb Sikandra', 'Mausoleum', 'Mehtab Bagh', 'Metcelf-Hall',
       'Moovarkoil  Kodumbalur', 'Nagarjuna Kunda', 'Natural Caven with inscription eladipattam  Sittannavasal', 'Observatory of Man Singh', 'Old Fort  Sholapur', 'Old Fort (Shahi Fort)  Jaunpur',
       'Palace Complex at Ramnagar  Distt. Udhampur', 'Palace of Tipu Sultan', 'Pandulena Caves', 'Piprahwa & Ganwaria', 'Purana Qila', 'Qutub Minar', 'Raigad Fort', 'Rajarani Temple', 'Ram Bagh',
       'Ranghar Pavillion  Jaisagar', 'Rani Jhansi Fort', 'Rani Ki-Vav  Patan', 'Rani Lakshmi Bai Mahal', 'Red Fort', 'Remains of Patliputra Site of Mauryan Palace  Kumrahar', 'Residency Building',
       'Rock-cut Jain Temple  Sittannavasasl', 'Rock-cut Temples and Sculptures', "Roopmati's Pavilion", 'Rudabai Step Well  Adalaj', 'Ruined Fort  kangra',
       'Ruins of Buddhist Temples and Images lalitgiri', 'Safdarjung Tomb', 'Shaniwarwada', "Sheikh Chilli's Tomb", "Sher Shah's Tomb", 'Site of Sahet mahet', 'Sultanghari Tomb',
       'Sun Temple  Konark', 'Sun temple  Modhera', 'Suraj Kund', 'Taj Mahal', 'Temple of Laxman and Old sites including sculptures sirpur', 'Temples & Sculpture Shed  lakkumdi',
       'The Hill Containing Many Valuable Sculptures and Images Ratnagiri', 'The palace situated in the fort  Burhanpur', 'Tiger headed Rock cut temple & two other monuments  Saluvankuppam',
       "Tirumalai Nayak's Palace Srivilliputhur", 'Tomb of Lal Khan', 'Tomb of Rabia Durani (Bibi ka Maqbara)', 'Total', 'Tughluqabad', 'Undavalli caves', 'Upper Fort Aguada', 'Warangal',
       'Western Group of Temples  Khajuraho', 'mattancherry Palace Museum Kochi'])

In [34]:
monumento

array(['Aga  Khan Palace Building', 'Agra Fort', "Ahom Raja's Palace", 'Ajanta Caves', 'Akbar Tomb Sikandra', 'Amaravati mahastupa',
       'Ancient Buddhist Remains comprising monastery stupa  rock sculptures  inscriptions ect Mansur', 'Ancient Buddhist Site known as Chaukhandi stupa', 'Ancient Palace Leh',
       'Ancient Remains on both Udaigiri & Khandagiri Hills', 'Ancient Site Bhangarh', 'Ancient Site and Adamgrah rock shelter  Kalamdi Rasuliya and kishanpur',
       'Ancient Site of Vikramshila Antichak', 'Ancient site of Vaishali  Kolhua', 'Asokan Rock Edict  Jungadh', 'Aurangabad Caves', 'Avantiswamin Temple  Avantipur  District Pulwama',
       'Baba Pyara Caves Junagadh & Khapra Khodiya Caves  Junagadh', 'Badal Mahal Gatwway  Chanderi', 'Baori at Abhaneri', 'Bekal Fort  Pallikkare  Distt. Kasargod', 'Bellary Fort',
       'Bir Singh Palace Datia', 'Bishnudol', 'Bishnupur Temples', 'Buddhish Caves  Junagadh', 'Buddhist Caves', 'Buddhist Caves Kanheri', 'Buddhist Monuments  Sa

In [35]:
monumento.shape

(146,)

In [36]:
# Loop para converter os nomes dos monumentos em valores numéricos
# Chamamos isso de label encoding
for i in range(146):
    strings[:,1] = np.where(strings[:,1] == monumento[i], i, strings[:,1])
strings[:,1]  

array(['130', '1', '48', '4', '89', '74', '107', '91', '90', '139', '125', '114', '102', '139', '67', '78', '111', '109', '139', '87', '98', '96', '47', '137', '7', '139', '20', '145', '49', '51',
       '139', '62', '56', '135', '139', '53', '93', '115', '136', '95', '50', '52', '139', '26', '134', '71', '66', '117', '28', '68', '30', '18', '34', '22', '139', '11', '144', '63', '139', '43',
       '75', '61', '57', '73', '132', '139', '60', '21', '139', '40', '80', '100', '54', '139', '39', '70', '139', '24', '92', '139', '14', '25', '17', '139', '35', '128', '110', '118', '139', '127',
       '9', '106', '133', '120', '139', '3', '45', '138', '41', '101', '15', '139', '44', '27', '122', '0', '31', '33', '32', '105', '82', '97', '76', '83', '86', '139', '55', '6', '139', '129',
       '123', '139', '81', '126', '140', '112', '72', '104', '69', '77', '121', '84', '103', '139', '2', '79', '108', '23', '65', '139', '142', '139', '37', '58', '143', '139', '10', '19', '42',
       '139', '

In [37]:
strings

array([['0', '130', 'excellent'],
       ['0', '1', 'good'],
       ['0', '48', 'bad'],
       ['0', '4', 'bad'],
       ['0', '89', 'bad'],
       ['0', '74', 'excellent'],
       ['0', '107', 'bad'],
       ['0', '91', 'bad'],
       ['0', '90', 'bad'],
       ['33', '139', 'excellent'],
       ['21', '125', 'bad'],
       ['21', '114', 'bad'],
       ['21', '102', 'excellent'],
       ['33', '139', 'bad'],
       ['17', '67', 'bad'],
       ['17', '78', 'good'],
       ['17', '111', 'good'],
       ['17', '109', 'bad'],
       ['33', '139', 'excellent'],
       ['28', '87', 'excellent'],
       ['28', '98', 'bad'],
       ['28', '96', 'excellent'],
       ['28', '47', 'good'],
       ['28', '137', 'excellent'],
       ['28', '7', 'excellent'],
       ['33', '139', 'excellent'],
       ['31', '20', 'excellent'],
       ['31', '145', 'bad'],
       ['31', '49', 'bad'],
       ['31', '51', 'good'],
       ['33', '139', 'excellent'],
       ['7', '62', 'bad'],
       ['7', '56', 'bad'],

### Pré-Processamento da Variável Avaliação com Binarização

In [38]:
strings[:,2]  

array(['excellent', 'good', 'bad', 'bad', 'bad', 'excellent', 'bad', 'bad', 'bad', 'excellent', 'bad', 'bad', 'excellent', 'bad', 'bad', 'good', 'good', 'bad', 'excellent', 'excellent', 'bad',
       'excellent', 'good', 'excellent', 'excellent', 'excellent', 'excellent', 'bad', 'bad', 'good', 'excellent', 'bad', 'bad', 'good', 'excellent', 'excellent', 'good', 'bad', 'bad', 'excellent',
       'bad', 'good', 'good', 'good', 'excellent', 'good', 'good', 'excellent', 'excellent', 'good', 'bad', 'bad', 'excellent', 'bad', 'excellent', 'bad', 'bad', 'bad', 'good', 'bad', 'bad', 'good',
       'good', 'excellent', 'bad', 'excellent', 'bad', 'bad', 'excellent', 'excellent', 'excellent', 'good', 'good', 'excellent', 'bad', 'good', 'good', 'excellent', 'bad', 'excellent', 'good',
       'bad', 'excellent', 'good', 'bad', 'good', 'excellent', 'bad', 'excellent', 'good', 'excellent', 'excellent', 'bad', 'bad', 'bad', 'bad', 'bad', 'good', 'bad', 'good', 'excellent', 'good',
       'good', 'bad'

In [39]:
# Número de elementos
np.unique(strings[:,2]).size

3

In [40]:
# Criamos um array com apenas as avaliações boas
status_good = np.array(['', 'excellent', 'good'])

In [41]:
# Checamos agora os valores da variável e comparamos com o array anterior convertendo a variável para valores binários
# Chamamos isso de binarização
strings[:,2] = np.where(np.isin(strings[:,2], status_good),1,0)

In [42]:
# Extrai os valores únicos da variável
np.unique(strings[:,2])

array(['0', '1'], dtype='<U110')

In [43]:
strings

array([['0', '130', '1'],
       ['0', '1', '1'],
       ['0', '48', '0'],
       ['0', '4', '0'],
       ['0', '89', '0'],
       ['0', '74', '1'],
       ['0', '107', '0'],
       ['0', '91', '0'],
       ['0', '90', '0'],
       ['33', '139', '1'],
       ['21', '125', '0'],
       ['21', '114', '0'],
       ['21', '102', '1'],
       ['33', '139', '0'],
       ['17', '67', '0'],
       ['17', '78', '1'],
       ['17', '111', '1'],
       ['17', '109', '0'],
       ['33', '139', '1'],
       ['28', '87', '1'],
       ['28', '98', '0'],
       ['28', '96', '1'],
       ['28', '47', '1'],
       ['28', '137', '1'],
       ['28', '7', '1'],
       ['33', '139', '1'],
       ['31', '20', '1'],
       ['31', '145', '0'],
       ['31', '49', '0'],
       ['31', '51', '1'],
       ['33', '139', '1'],
       ['7', '62', '0'],
       ['7', '56', '0'],
       ['7', '135', '1'],
       ['33', '139', '1'],
       ['32', '53', '1'],
       ['32', '93', '1'],
       ['32', '115', '0'],
       ['3

## Construindo o Dataset Final

In [44]:
numeric.shape

(178, 6)

In [45]:
strings.shape

(178, 3)

In [46]:
# Concatena os arrays de dados
dados_final = np.hstack((numeric, strings))
dados_final

array([['4429710.0', '645415.0', '1259892.0', ..., '0', '130', '1'],
       ['1627154.0', '386522.0', '371242.0', ..., '0', '1', '1'],
       ['454376.0', '184751.0', '107835.0', ..., '0', '48', '0'],
       ...,
       ['22353.0', '85.0', '10219.0', ..., '1', '29', '1'],
       ['262949.0', '1008.0', '116120.0', ..., '33', '139', '0'],
       ['43607075.0', '2756561.0', '13153076.0', ..., '11', '59', '0']], dtype='<U110')

In [47]:
columns_index_order = [6,7,0,1,2,3,4,5,8]

In [48]:
dados_final = dados_final[:,columns_index_order]
dados_final

array([['0', '130', '4429710.0', ..., '-71.56', '-98.6', '1'],
       ['0', '1', '1627154.0', ..., '-77.18', '-99.27', '1'],
       ['0', '48', '454376.0', ..., '-76.27', '-99.69', '0'],
       ...,
       ['1', '29', '22353.0', ..., '-54.28', '-88.24', '1'],
       ['33', '139', '262949.0', ..., '-55.84', '-86.41', '0'],
       ['11', '59', '43607075.0', ..., '-69.84', '-84.91', '0']], dtype='<U110')

In [49]:
# Concatena os arrays de cabecalho
cabecalho_final = np.hstack((header_strings, header_numeric))
cabecalho_final

array(['Cidade', 'Monumento', 'Avaliacao', 'Domestico-2019-20', 'Estrangeiro-2019-20', 'Domestico-2020-21', 'Estrangeiro-2020-21', 'Crescimento-Domestico-21/19', 'Crescimento-Estrangeiro-21/19'],
      dtype='<U33')

In [50]:
columns_index_order_cabecalho = [0,1,3,4,5,6,7,8,2]

In [51]:
cabecalho_final = cabecalho_final[columns_index_order_cabecalho]
cabecalho_final

array(['Cidade', 'Monumento', 'Domestico-2019-20', 'Estrangeiro-2019-20', 'Domestico-2020-21', 'Estrangeiro-2020-21', 'Crescimento-Domestico-21/19', 'Crescimento-Estrangeiro-21/19', 'Avaliacao'],
      dtype='<U33')

In [53]:
# Concatena o array de nomes de colunas com o array de dados
df_final = np.vstack((cabecalho_final, dados_final))
df_final

array([['Cidade', 'Monumento', 'Domestico-2019-20', ..., 'Crescimento-Domestico-21/19', 'Crescimento-Estrangeiro-21/19', 'Avaliacao'],
       ['0', '130', '4429710.0', ..., '-71.56', '-98.6', '1'],
       ['0', '1', '1627154.0', ..., '-77.18', '-99.27', '1'],
       ...,
       ['1', '29', '22353.0', ..., '-54.28', '-88.24', '1'],
       ['33', '139', '262949.0', ..., '-55.84', '-86.41', '0'],
       ['11', '59', '43607075.0', ..., '-69.84', '-84.91', '0']], dtype='<U110')

## Salvando em disco o dataframe final

In [54]:
# Salva em disco
np.savetxt("dataset_limpo_preprocessado_indiano.csv", 
           df_final, 
           fmt = '%s',
           delimiter = ',')

## Concluindo

Este projeto teve o objetivo de fazer a Limpeza e Pré-Processamento de Dados utilizando apenas o NumPy, para que seja utilizado em um modelo de machine learning.
Voce viu o Tratamento valores ausentes, convertemos string em numeros, aplicação de label encoding e binarização.
Espero que tenha gostado.

Everton Crespi