# 06 - Combining datasets

### Step 1. Import the necessary libraries

In [8]:
import numpy as np
import pandas as pd

### Step 2. Import the datasets you'll find in the folder `data` and assign each to a variable called cars1 and cars2

The following exercise uses data from [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Auto+MPG)

In [9]:
cars1 = pd.read_csv('data/cars1.csv')
cars2 = pd.read_csv('data/cars2.csv')

### Step 3. Have a look at the columns of the dataset.

In [10]:
cars1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 198 entries, 0 to 197
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           198 non-null    float64
 1   cylinders     198 non-null    int64  
 2   displacement  198 non-null    int64  
 3   horsepower    198 non-null    object 
 4   weight        198 non-null    int64  
 5   acceleration  198 non-null    float64
 6   model         198 non-null    int64  
 7   origin        198 non-null    int64  
 8   car           198 non-null    object 
 9   Unnamed: 9    0 non-null      float64
 10  Unnamed: 10   0 non-null      float64
 11  Unnamed: 11   0 non-null      float64
 12  Unnamed: 12   0 non-null      float64
 13  Unnamed: 13   0 non-null      float64
dtypes: float64(7), int64(5), object(2)
memory usage: 21.8+ KB


In [11]:
cars2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           200 non-null    float64
 1   cylinders     200 non-null    int64  
 2   displacement  200 non-null    int64  
 3   horsepower    200 non-null    object 
 4   weight        200 non-null    int64  
 5   acceleration  200 non-null    float64
 6   model         200 non-null    int64  
 7   origin        200 non-null    int64  
 8   car           200 non-null    object 
dtypes: float64(2), int64(5), object(2)
memory usage: 14.2+ KB


### Step 4. Oops, it seems our first dataset has some unnamed blank columns, fix cars1

In [12]:
cars1.dropna(axis=1, how='all', inplace=True)

In [13]:
cars1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 198 entries, 0 to 197
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           198 non-null    float64
 1   cylinders     198 non-null    int64  
 2   displacement  198 non-null    int64  
 3   horsepower    198 non-null    object 
 4   weight        198 non-null    int64  
 5   acceleration  198 non-null    float64
 6   model         198 non-null    int64  
 7   origin        198 non-null    int64  
 8   car           198 non-null    object 
dtypes: float64(2), int64(5), object(2)
memory usage: 14.0+ KB


### Step 5. What is the number of observations in each dataset?

In [14]:
print(f'El número de observaciones de cars1 es de {cars1.shape[0]}')
print(f'El número de observaciones de cars1 es de {cars2.shape[0]}')

El número de observaciones de cars1 es de 198
El número de observaciones de cars1 es de 200


### Step 6. Join cars1 and cars2 into a single DataFrame called cars

In [21]:
cars = pd.concat([cars1, cars2], axis=0).reset_index(drop=True)

In [22]:
cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           398 non-null    float64
 1   cylinders     398 non-null    int64  
 2   displacement  398 non-null    int64  
 3   horsepower    398 non-null    object 
 4   weight        398 non-null    int64  
 5   acceleration  398 non-null    float64
 6   model         398 non-null    int64  
 7   origin        398 non-null    int64  
 8   car           398 non-null    object 
dtypes: float64(2), int64(5), object(2)
memory usage: 28.1+ KB


### Step 7. Oops, there is a column missing, called owners. Create a random number Series from 15,000 to 73,000.

In [23]:
import random

random_num = np.random.randint(low=15000, high=73000, size=cars.shape[0])
random_serie = pd.Series(random_num)
random_serie

0      60492
1      48566
2      27629
3      56378
4      23083
       ...  
393    15993
394    37124
395    41794
396    44144
397    71309
Length: 398, dtype: int64

### Step 8. Add the column owners to cars

In [24]:
cars['owners'] = random_serie
cars.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model,origin,car,owners
0,18.0,8,307,130,3504,12.0,70,1,chevrolet chevelle malibu,60492
1,15.0,8,350,165,3693,11.5,70,1,buick skylark 320,48566
2,18.0,8,318,150,3436,11.0,70,1,plymouth satellite,27629
3,16.0,8,304,150,3433,12.0,70,1,amc rebel sst,56378
4,17.0,8,302,140,3449,10.5,70,1,ford torino,23083
