# Welcome to the Pandas Tutorial

In [None]:
# Em Jupyter, usamos "!" para rodar comandos de CLI
# No c√≥digo abaixo, buscamos onde que est√° rodando nosso Python
!where python

In [None]:
%matplotlib inline

## Imports

In [None]:
import pandas as pd


## Load Data

In [None]:
# Aqui transforma o arquivo csv em um DataFrame (basicamente uma table de um database ou uma planilha de Excel)
df = pd.read_csv('../../Datasets/world-happiness/2019.csv')

In [None]:
df

## Inspect Data

### Look at the first 5 rows

In [None]:
df.head()

### Look at the shape of our data

In [None]:
num_rows, num_cols = df.shape
print(f"Number of rows: {num_rows}\nNumber of Cols: {num_cols}")

### Checking the data types of our columns

In [None]:
df.dtypes

### Describe the data

In [None]:
df.describe()

### Info of the data

In [None]:
df.info()

# The Difference Between Pandas Series and DataFrame

Pandas provides two main data structures:

- **Series**: a one-dimensional labeled array
- **DataFrame**: a two-dimensional labeled table

Understanding the difference between them is fundamental for data analysis, data science, and machine learning.


## 1. Pandas Series

A **Series** is a one-dimensional data structure.
It represents a single column of data with an associated index.

Key characteristics:
- One-dimensional
- Has values and an index
- Similar to a column in Excel


In [None]:
# Type of a Column
type(df['Country or region'])

## 2. Pandas DataFrame

A **DataFrame** is a two-dimensional data structure made of multiple Pandas Series.

Each column in a DataFrame is a Series, and all Series share the same index.

Key characteristics:
- Two-dimensional (rows and columns)
- Different columns can have different data types
- Similar to an Excel spreadsheet or SQL table


In [None]:
# Type of a DataFrame
type(df)

## 3. Relationship Between DataFrame and Series

A very important concept:

- A DataFrame column is a Pandas Series


# Section #1 - Working with one dataset

## Question #1 - What is the average, min, and max happiness score?

In [None]:
# In this case, we can use the describe method in the Score Column
df['Score'].describe()

In [None]:
# Otherwise, we can use different methods, like 'mean', 'min', 'max'.
mean_score = df['Score'].mean()
min_score = df['Score'].min()
max_score = df['Score'].max()

print(
    f"The answer is:\nAverage:{mean_score:.4}\nMin:{min_score}\nMax:{max_score}")

## Question #2 - What is the score for the Brazil?

### Option #1 - Use Bracket notation filtering

In [None]:
df.head()

In [None]:
# Create a boolean mask
brazil_boolean_mask = df['Country or region'] == 'Brazil'
brazil_boolean_mask

In [None]:
df[brazil_boolean_mask]

In [None]:
# It's only one row, but it stills a DataFrame
type(df[brazil_boolean_mask])

In [None]:
# So we can use methods of DataFrames, in this example, we're going to get the Score of Brazil
brazil_score = df[brazil_boolean_mask]['Score']

print(f"The Score of Brazil is: {brazil_score}")

In [None]:
# It returns an Panda Series, to access the value itself we have to do this way, which return an numpy array
brazil_score = df[brazil_boolean_mask]['Score'].values

print(f"The Score of Brazil is: {brazil_score}")

In [None]:
# So to filter only the value itself, we pass the [0]
brazil_score = df[brazil_boolean_mask]['Score'].values[0]

print(f"The Score of Brazil is: {brazil_score}")

In [None]:
# In only one line of code
df[df['Country or region'] == 'Brazil']['Score'].values[0]

### Option #2 - Use Indices with _.iloc_ and _.loc_

**_iloc is a way to access data in a DataFrame by the numeric position of the rows and columns**

In [None]:
# Index 0, second column
df.iloc[0, 1]

**_loc is a way to access data in a Dataframe by the name of the Index and the Column**

In [None]:
# First, we have to alter the index to the Country Name
country_name_index = df.set_index(['Country or region'])
country_name_index

In [None]:
# Then, we pass the .loc method to return the filtered row
country_name_index.loc['Brazil']

In [None]:
# We can improve the filter passing the Column name
country_name_index.loc['Brazil', 'Score']

## Question #3 - What is the country with highest Generosity score?

### Option #1 - Use sort values

In [None]:
highest_generosity_country = df.sort_values(
    by='Generosity', ascending=False).iloc[0, 1]

print(
    f'The country with the Highest Generosity is: {highest_generosity_country}.\nThe Generosity there is: {country_name_index.loc['Myanmar', 'Generosity']}')

### Option #2 - Use argmax

In [None]:
max_generosity_idx = df['Generosity'].argmax()
df.iloc[max_generosity_idx, 1]

## Question #4 - What does the distribution of happiness scores look like?

In [None]:
score_column = df['Score']
score_column.hist()

## Question #5 - What does the distribution of all our columns look like?


In [None]:
# We assign it in some throwaray variable, the conventional is underscore '_'
_ = df.hist(figsize=(10, 10))

## Question #5 - What is the relationship between social support and happiness score?

In [None]:
df.plot(x='Social support', y='Score',
        title='Relationship between social support and happiness', kind='scatter', figsize=(10, 8))

## Question #6 - What if we wanted to have the Score go from 0 to 100 rather than 0 to 10?

In [None]:
df_copy = df.copy()

In [None]:
df_copy['Score out of 100'] = df['Score'] * 10

df_copy

# Section #2 - Concatenating multiple datasets

## Question #8 - I have multiple datasets that I want to combine, how can I bring them all together?

### Step #1 - Load all of the data

In [None]:
import glob

In [None]:
file_names = glob.glob('../../Datasets/world-happiness/*.csv')

In [None]:
dfs = {}

In [None]:
for f in file_names:
    data = pd.read_csv(f)
    year = f[-8:-4]
    dfs[year] = data

In [None]:
dfs.keys()

In [None]:
dfs['2019']

## Step #2 - Look at the shape of each year's DataFrame

In [None]:
for year, data in dfs.items():
    print(f'{year}: {data.shape}')

## Step #3 - Add a 'Year' column to each DataFrame

In [None]:
for year, data in dfs.items():
    print(f"{year}: {data.head(1)}")

In [None]:
for year, data in dfs.items():
    dfs[year]['Year'] = year

In [None]:
dfs['2015'].head()

In [None]:
dfs['2018'].head()

In [None]:
dfs['2019'].head()

## Step #4 - Combine data for years 2019 and 2018

In [None]:
type(dfs)

In [None]:
# We can use the 'set' to see the differences of two dataframes, as an example:
set([2, 3, 4]) ^ set([4, 5, 6])

In [None]:
dfs['2018'].columns

In [None]:
dfs['2019'].columns

In [None]:
# Now let's see if our columns are different or not

In [None]:
set(dfs['2018'].columns) ^ set(dfs['2019'].columns)

In [None]:
# Concatenate the two DataFrames
df_concat = pd.concat([dfs['2018'], dfs['2019']])

In [None]:
# Here we can see that the concat uses the same index, to fix this, we have to pass 'ignore_index' to True, the default is False
df_concat.loc[0]

In [None]:
# Concatenate the two DataFrames
df_concat = pd.concat([dfs['2018'], dfs['2019']], ignore_index=True)

In [None]:
df_concat

In [None]:
df_concat.loc[0]

### Step #5 - Combine data for years 2019, 2018, and 2015

In [None]:
set(dfs['2019'].columns) ^ set(dfs['2015'].columns)

In [None]:
dfs['2019'].head(1)

In [None]:
dfs['2015'].head(1)

### Change the Name of The Columns

In [None]:
df_copy.columns

In [None]:
old_columns = list(df_copy.columns)

In [None]:
new_columns = old_columns.copy()

In [None]:
new_columns[0] = 'RANK'

In [None]:
old_columns

In [None]:
new_columns

In [None]:
df_copy.columns = new_columns

In [None]:
df_copy.columns

In [None]:
dfs['2019'].head(1)

In [None]:
dfs['2015'].head(1)

In [None]:
# In our example
new_2015_column_names = [
    "Country or region", "Region", "Overall rank", "Score", "Standard Error", "GDP per capita",
    "Social support", "Healthy life expectancy", "Freedom to make life choices", "Perception of corruption", "Generosity", "Dystopia Residual", "Year"
]

In [None]:
dfs['2015'].columns = new_2015_column_names

In [None]:
df_concat = pd.concat(
    [dfs['2015'], dfs['2019'], dfs['2018']], ignore_index=True)

## As an exercise: Concat the 2016 and 2017 data

## Question #9 - It looks like we have Region Data for 2015, but not for 2018 or 2019, is there a way to bring in Region data for the years we dont have it?

### Option #1 - Mapping (we're not going to use this)

In [None]:
df_copy = df_concat.copy()

In [None]:
region_mapping_dict = {
    "Rwanda": "Sub-Saharan Africa",
    "Syria": "Middle East and Northern Africa"
}

In [None]:
df_copy['Country or region'].map(region_mapping_dict)

### Option #2 - Merge

In [None]:
dfs['2015'][['Country or region', 'Region']]

In [None]:
region_mapping_df = dfs['2015'][['Country or region', 'Region']]

In [None]:
df_concat

In [None]:
pd.merge(left=df_concat, right=region_mapping_df,
         how="right", on="Country or region")

In [None]:
# See that it creates two columns, x and y, to fix this, we have to drop the column Region
df_concat = df_concat.drop(columns='Region')

In [None]:
df_concat_with_region = pd.merge(left=df_concat, right=region_mapping_df,
                                 how="right", on="Country or region")

In [None]:
df_concat_with_region

## Question #9b - Are there any regions that didn't match up?

In [None]:
df_concat_with_region['Region'].isna()

In [None]:
no_region_mask = df_concat_with_region['Region'].isna()

In [None]:
df_concat_with_region[no_region_mask]

In [None]:
no_region_mask.sum()

## Question #10 - Now that we have a nice combined dataset, how do we write it back out to a file for later usage?

In [None]:
df_concat_with_region.to_csv('combined-happiness-score.csv', index=False)

In [None]:
ls

# Section #3 - Working with our new data from multiple years

In [None]:
df_concat_with_region.head()

## Question #11 - How many countries do we have per Region?

In [None]:
# This results of duplicates Regions, because there are multiple years
df_concat_with_region['Region'].value_counts()

In [None]:
only_one_year_mask = df_concat_with_region['Year'] == '2019'

In [None]:
# To really get all the countries per Region, we have to filter one year only
df_concat_with_region[only_one_year_mask]['Region'].value_counts()

## Question #12 - What is the average happiness score by region?

In [None]:
avg_happiness_region = df_concat_with_region[[
    'Region', 'Score']].groupby('Region').mean()

In [None]:
avg_happiness_region.sort_values(by='Score', ascending=False)

## Question #13 - What is the average happiness score over time?

In [None]:
df_concat_with_region['Year'].dtype

In [None]:
df_concat_with_region['Year'] = df_concat_with_region['Year'].astype(int)

In [None]:
df_concat_with_region.groupby('Year').mean(numeric_only=True)

In [None]:
avg_happiness_over_time = df_concat_with_region.groupby(
    'Year').mean(numeric_only=True)['Score']

In [None]:
avg_happiness_over_time.sort_values(ascending=False)

## Question #14 - In 2019, how many countries in Africa (sub-Saharan and Northern/Middle East) have a happiness score higher than the lowest score in Western Europe?

**First, let's get only the year 2019**

In [None]:
year_2019_mask = df_concat_with_region['Year'] == 2019

In [None]:
df_only_2019 = df_concat_with_region[year_2019_mask]

In [None]:
df_only_2019.head()

**Then, let's find the lowest score in Western Europe**

In [None]:
western_europe_2019_mask = df_only_2019['Region'] == 'Western Europe'

In [None]:
score_western_europe_2019 = df_only_2019[western_europe_2019_mask]['Score'].values

In [None]:
lowest_score_western_europe = score_western_europe_2019.min()

print(f'Lowest Score in Western Europe: {lowest_score_western_europe}')

**With this, we can filter the Countries in Africa that has higher score than the lowest score in Western Europe**

In [None]:
df_only_2019['Region'].unique()

In [None]:
africa_2019_mask = (
    (df_only_2019['Region'] == 'Middle East and Northern Africa') |
    (df_only_2019['Region'] == 'Sub-Saharan Africa')
)

In [None]:
african_countries_2019 = df_only_2019[africa_2019_mask]

In [None]:
african_countries_2019.head()

In [None]:
african_countries_score_higher_than_western_europe = african_countries_2019[
    'Score'] > lowest_score_western_europe

In [None]:
len(african_countries_2019[african_countries_score_higher_than_western_europe]
    ['Country or region'])

## Question #15 - Can we look at the change in average happiness levels by region over time?

In [None]:
df_concat_with_region.groupby(['Year', 'Region'])['Score'].mean()

In [None]:
# We can use unstack to change the index to Year and the Columns to Regions
df_concat_with_region.groupby(['Year', 'Region'])[
    'Score'].mean().unstack(level=0).plot(kind='bar')

## Question #16 - Which countries had their score increase the most over time?

In [None]:
df_concat_with_region.pivot_table(
    values='Score', index='Country or region', columns='Year')

In [None]:
df_countries_over_time = df_concat_with_region.pivot_table(
    values='Score', index='Country or region', columns='Year')

In [None]:
df_countries_over_time['Difference'] = df_countries_over_time[2019] - \
    df_countries_over_time[2015]

In [None]:
df_countries_over_time.head()

In [None]:
df_countries_over_time.sort_values(by='Difference', ascending=False)

# Homework

1. See about getting 2016 and 2017 data into our DataFrame.
2. Join in some additional data to this dataset using pd.merge. For example, population data.
3. Find your own dataset to play with!

# üêº Pandas Extra Tips
## T√©cnicas Essenciais de Limpeza de Dados

---

## √çndice
1. Setup e Dataset
2. Remove Duplicates (Remover Duplicatas)
3. Strip Values (Remover Espa√ßos)
4. Splitting Columns (Dividir Colunas)
5. Replace (Substituir Valores)
6. Fill NA (Preencher Valores Nulos)
7. Combinando T√©cnicas

---
## Criando Dataset de Exemplo

Vamos criar um dataset com problemas comuns de dados "sujos":
- Duplicatas
- Espa√ßos em branco
- Colunas que precisam ser divididas
- Valores inconsistentes
- Valores nulos

In [None]:
import numpy as np
# Criar dataset "sujo" para praticar limpeza
data = {
    'customer_id': [1, 2, 2, 3, 4, 5, 5, 6, 7, 8, 9, 10],
    'name': ['  John Doe  ', 'Jane Smith', 'Jane Smith', '  Bob Johnson',
             'Alice Brown  ', 'Charlie Wilson', 'Charlie Wilson', 'Diana Prince',
             'Eve Adams', 'Frank Miller', '  Grace Lee', 'Henry Ford  '],
    'email_phone': ['john@email.com|555-0100', 'jane@email.com|555-0101',
                    'jane@email.com|555-0101', 'bob@email.com|555-0102',
                    'alice@email.com|555-0103', 'charlie@email.com|555-0104',
                    'charlie@email.com|555-0104', 'diana@email.com|555-0105',
                    'eve@email.com|555-0106', 'frank@email.com|555-0107',
                    'grace@email.com|555-0108', 'henry@email.com|555-0109'],
    'status': ['Active', 'active', 'active', 'INACTIVE', 'Active',
               'Inactive', 'Inactive', 'active', 'ACTIVE', 'inactive',
               'Active', 'Active'],
    'purchase_amount': [100.50, 250.00, 250.00, np.nan, 450.75,
                        125.30, 125.30, 890.00, np.nan, 340.20,
                        np.nan, 560.80],
    'city_state': ['New York, NY', 'Los Angeles, CA', 'Los Angeles, CA',
                   'Chicago, IL', 'Houston, TX', 'Phoenix, AZ', 'Phoenix, AZ',
                   'Philadelphia, PA', 'San Antonio, TX', 'San Diego, CA',
                   'Dallas, TX', 'San Jose, CA']
}

df = pd.DataFrame(data)
print(f"Dataset criado com {len(df)} linhas!")
df

**Problemas neste dataset:**
- ‚ùå Linhas duplicadas (customer_id 2 e 5)
- ‚ùå Espa√ßos em branco nos nomes
- ‚ùå Email e telefone na mesma coluna
- ‚ùå Status com capitaliza√ß√£o inconsistente
- ‚ùå Valores nulos em purchase_amount
- ‚ùå Cidade e estado juntos

### Verificar duplicatas

In [None]:
# Ver quais linhas s√£o duplicadas
print("\nLinhas duplicadas:")
df[df.duplicated(keep=False)]

### Remover duplicatas - M√©todo 1: Todas as colunas

In [None]:
# Remover duplicatas considerando TODAS as colunas
df_no_dup = df.drop_duplicates()

print(f"Linhas antes: {len(df)}")
print(f"Linhas depois: {len(df_no_dup)}")
print(f"Linhas removidas: {len(df) - len(df_no_dup)}")

df_no_dup

### Remover duplicatas - M√©todo 2: Baseado em colunas espec√≠ficas

In [None]:
# Remover duplicatas baseado apenas no customer_id
df_no_dup_id = df.drop_duplicates(subset=['customer_id'])

print(f"Linhas antes: {len(df)}")
print(f"Linhas depois: {len(df_no_dup_id)}")
print(f"Linhas removidas: {len(df) - len(df_no_dup_id)}")

df_no_dup_id

### Escolher qual duplicata manter

In [None]:
# Manter a primeira ocorr√™ncia (padr√£o)
df_keep_first = df.drop_duplicates(subset=['customer_id'], keep='first')
print("Mantendo PRIMEIRA ocorr√™ncia:")
print(df_keep_first[df_keep_first['customer_id'].isin([2, 5])])

print("\n" + "="*50 + "\n")

# Manter a √∫ltima ocorr√™ncia
df_keep_last = df.drop_duplicates(subset=['customer_id'], keep='last')
print("Mantendo √öLTIMA ocorr√™ncia:")
print(df_keep_last[df_keep_last['customer_id'].isin([2, 5])])

### Dica: Remover duplicatas in-place

In [None]:
# Criar c√≥pia para demonstra√ß√£o
df_copy = df.copy()

# Remover duplicatas modificando o pr√≥prio dataframe
df_copy.drop_duplicates(subset=['customer_id'], inplace=True)

print(f"Linhas ap√≥s drop_duplicates com inplace=True: {len(df_copy)}")
df_copy.head()

---
## Strip Values (Remover Espa√ßos)

### Objetivo:
Remover espa√ßos em branco no in√≠cio e fim de strings

In [None]:
# Vamos trabalhar com df_no_dup_id (sem duplicatas)
df_clean = df_no_dup_id.copy()

print("ANTES do strip:")
print(df_clean['name'].head())
print("\nRepare nos espa√ßos:")
for name in df_clean['name'].head():
    print(f"'{name}'")

### Strip com .str.strip()

In [None]:
# Aplicar strip na coluna name
df_clean['name'] = df_clean['name'].str.strip()

print("\nDEPOIS do strip:")
for name in df_clean['name'].head():
    print(f"'{name}'")

df_clean.head()

---
## Splitting Columns (Dividir Colunas)

### Objetivo:
Separar uma coluna em m√∫ltiplas colunas

In [None]:
print("Coluna email_phone ANTES da divis√£o:")
print(df_clean['email_phone'].head())

### M√©todo 1: Split com .str.split()

In [None]:
# Dividir email_phone em duas colunas
df_clean[['email', 'phone']] = df_clean['email_phone'].str.split(
    '|', expand=True)

print("\nDEPOIS da divis√£o:")
print(df_clean[['email_phone', 'email', 'phone']].head())

### M√©todo 2: Split de city_state

In [None]:
# Dividir city_state em city e state
df_clean[['city', 'state']] = df_clean['city_state'].str.split(
    ', ', expand=True)

print("City e State separados:")
print(df_clean[['city_state', 'city', 'state']].head())

### Remover coluna original ap√≥s split

In [None]:
# Remover colunas que j√° foram divididas
df_clean = df_clean.drop(columns=['email_phone', 'city_state'])

print("‚úÖ Colunas removidas!")
print(f"\nColunas atuais: {df_clean.columns.tolist()}")
df_clean.head()

---
## Replace (Substituir Valores)

### Objetivo:
Substituir valores espec√≠ficos ou padr√µes

In [None]:
print("Coluna 'status' ANTES do replace:")
print(df_clean['status'].value_counts())

In [None]:
# Substituir valores individuais
df_clean['status'] = df_clean['status'].replace({
    'active': 'Active',
    'ACTIVE': 'Active',
    'inactive': 'Inactive',
    'INACTIVE': 'Inactive'
})

print("\nDEPOIS do replace:")
print(df_clean['status'].value_counts())

---
## Fill NA (Preencher Valores Nulos)

### Objetivo:
Tratar valores nulos (NaN) no dataset

In [None]:
# Verificar valores nulos
print("VALORES NULOS POR COLUNA:")
print(df_clean.isnull().sum())

print("\n" + "="*50)
print(f"\nTotal de valores nulos: {df_clean.isnull().sum().sum()}")
print(
    f"Percentual de dados faltantes: {(df_clean.isnull().sum().sum() / df_clean.size * 100):.2f}%")

### Visualizar linhas com valores nulos

In [None]:
print("\nLinhas com valores nulos em purchase_amount:")
df_clean[df_clean['purchase_amount'].isnull()]

### M√©todo 1: Preencher com valor espec√≠fico

In [None]:
# Preencher com 0
df_fill_zero = df_clean.copy()
df_fill_zero['purchase_amount'] = df_fill_zero['purchase_amount'].fillna(0)

print("Preenchido com 0:")
print(df_fill_zero[['name', 'purchase_amount']].head(10))

### M√©todo 2: Preencher com m√©dia

In [None]:
# Preencher com a m√©dia
df_fill_mean = df_clean.copy()
mean_value = df_fill_mean['purchase_amount'].mean()
df_fill_mean['purchase_amount'] = df_fill_mean['purchase_amount'].fillna(
    mean_value)

print(f"M√©dia calculada: ${mean_value:.2f}")
print("\nPreenchido com m√©dia:")
print(df_fill_mean[['name', 'purchase_amount']].head(10))

### Dica: Remover linhas com valores nulos

In [None]:
# Se preferir REMOVER linhas com NA em vez de preencher
df_drop_na = df_clean.copy()

print(f"Linhas antes: {len(df_drop_na)}")

# Remover linhas com qualquer valor nulo
df_drop_na = df_drop_na.dropna()

print(f"Linhas depois: {len(df_drop_na)}")
print(f"Linhas removidas: {len(df_clean) - len(df_drop_na)}")

---
## Combinando Todas as T√©cnicas

### Pipeline completo de limpeza de dados

In [279]:
# Come√ßar do dataset original
df_pipeline = df.copy()

print("="*60)
print("INICIANDO PIPELINE DE LIMPEZA DE DADOS")
print("="*60)

# 1. Remover duplicatas
print("\n1Ô∏è‚É£ Removendo duplicatas...")
initial_rows = len(df_pipeline)
df_pipeline = df_pipeline.drop_duplicates(subset=['customer_id'], keep='first')
print(f"   Linhas removidas: {initial_rows - len(df_pipeline)}")

# 2. Strip de espa√ßos
print("\n2Ô∏è‚É£ Removendo espa√ßos em branco...")
string_cols = df_pipeline.select_dtypes(include=['object']).columns
for col in string_cols:
    df_pipeline[col] = df_pipeline[col].str.strip()
print(f"   Colunas processadas: {len(string_cols)}")

# 3. Splitting de colunas
print("\n3Ô∏è‚É£ Dividindo colunas...")
df_pipeline[['email', 'phone']] = df_pipeline['email_phone'].str.split(
    '|', expand=True)
df_pipeline[['city', 'state']] = df_pipeline['city_state'].str.split(
    ', ', expand=True)
df_pipeline = df_pipeline.drop(columns=['email_phone', 'city_state'])
print(f"   Novas colunas criadas: email, phone, city, state")

# 4. Replace para padronizar
print("\n4Ô∏è‚É£ Padronizando valores...")
df_pipeline['status'] = df_pipeline['status'].replace({
    'active': 'Active',
    'ACTIVE': 'Active',
    'inactive': 'Inactive',
    'INACTIVE': 'Inactive'
})
print(f"   Coluna 'status' padronizada")

# 5. Fill NA
print("\n5Ô∏è‚É£ Preenchendo valores nulos...")
na_count = df_pipeline['purchase_amount'].isnull().sum()
df_pipeline['purchase_amount'] = df_pipeline['purchase_amount'].fillna(
    df_pipeline['purchase_amount'].median()
)
print(f"   Valores nulos preenchidos: {na_count}")

print("\n" + "="*60)
print("‚úÖ PIPELINE CONCLU√çDO!")
print("="*60)
print(
    f"\nDataset final: {len(df_pipeline)} linhas x {len(df_pipeline.columns)} colunas")

INICIANDO PIPELINE DE LIMPEZA DE DADOS

1Ô∏è‚É£ Removendo duplicatas...
   Linhas removidas: 2

2Ô∏è‚É£ Removendo espa√ßos em branco...
   Colunas processadas: 4

3Ô∏è‚É£ Dividindo colunas...
   Novas colunas criadas: email, phone, city, state

4Ô∏è‚É£ Padronizando valores...
   Coluna 'status' padronizada

5Ô∏è‚É£ Preenchendo valores nulos...
   Valores nulos preenchidos: 3

‚úÖ PIPELINE CONCLU√çDO!

Dataset final: 10 linhas x 8 colunas


In [280]:
# Visualizar dataset limpo
print("\nDATASET LIMPO - Primeiras linhas:")
df_pipeline.head()


DATASET LIMPO - Primeiras linhas:


Unnamed: 0,customer_id,name,status,purchase_amount,email,phone,city,state
0,1,John Doe,Active,100.5,john@email.com,555-0100,New York,NY
1,2,Jane Smith,Active,250.0,jane@email.com,555-0101,Los Angeles,CA
3,3,Bob Johnson,Inactive,340.2,bob@email.com,555-0102,Chicago,IL
4,4,Alice Brown,Active,450.75,alice@email.com,555-0103,Houston,TX
5,5,Charlie Wilson,Inactive,125.3,charlie@email.com,555-0104,Phoenix,AZ
