### __Python Data Wrangling Duplicate and Abscent values__

[_.value_countsdropna=False](#valueCounts)

[_df_logs = pd.read_csv('path', keep_default_na=False)](#keepDefaultNA)

[_df.groupby('column_to_group_by')['column_to_aggregate'].aggregation_function()_](#groupBy)

##### __Count missing values__

There are many ways to find and count missing values ​​in pandas. In this lesson, you learned three ways:

Calling info() on a DataFrame.

Calling isna().sum() on a DataFrame or Series.

Calling value_counts(dropna=False) on a Series.

In [2]:
import pandas as pd

df_logs = pd.read_csv('DataSets/visit_log.csv')

df_logs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 4 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   user_id   200000 non-null  int64 
 1   source    198326 non-null  object
 2   email     13953 non-null   object
 3   purchase  200000 non-null  int64 
dtypes: int64(2), object(2)
memory usage: 6.1+ MB


In [3]:
print(df_logs.isna().sum())

user_id          0
source        1674
email       186047
purchase         0
dtype: int64


Let's look at another way to find missing values. Let's use value_counts() on the 'source' column, but add the dropna=False parameter.

##### _.values_counts()_ <a index='valueCounts'></a>

Return a Series containing the frequency of each distinct row in the Dataframe.

_DataFrame.value_counts(subset=None, normalize=False, sort=True, ascending=False, dropna=True)_

_subset_ label or list of labels, optional, Columns to use when counting unique combinations.

_normalize_ bool, default False, Return proportions rather than frequencies.

_sort_ bool, default True, Sort by frequencies when True. Sort by DataFrame column values when False.

_ascending_ bool, default False, Sort in ascending order.

_dropna_ bool, default True, Don’t include counts of rows that contain NA values.

In [6]:
print(df_logs['source'].value_counts(dropna=False))

source
other      133834
context     52032
email       12279
NaN          1674
undef         181
Name: count, dtype: int64


##### __Filter missing values__

Filter the DataFrame and extract only the rows that do not have a missing 'source'. The approach we used above will work with one minor modification:

The only difference between this case and the one we were interested in for rows with missing values ​​is the addition of the tilde __symbol (~)__, which inverts the result. Here's the breakdown of this code:

- We extract the 'source' column using df_logs['source'].

- Next, we apply the isna() method to it to obtain a series of Booleans indicating missing values: df_logs['source'].isna().

- We invert the series using ~. This inverts all True values ​​to False and vice versa.

- We use this series of Booleans to filter the original DataFrame, extracting only the rows in which 'source' has no missing values.

- Finally, we print the resulting table.

In [None]:
print(df_logs[~df_logs['source'].isna()]) # Display non-null source values

           user_id   source       email  purchase
0       7141786820    other         NaN         0
1       5644686960    email  c129aa540a         0
2       1914055396  context         NaN         0
3       4099355752    other         NaN         0
4       6032477554  context         NaN         1
...            ...      ...         ...       ...
199995  8714621942    other         NaN         0
199996  6064948744  context         NaN         1
199997  9210683879  context         NaN         0
199998  1629959686    other         NaN         1
199999  2089329795    other         NaN         0

[198326 rows x 4 columns]


In [None]:
print(df_logs[df_logs['source'].isna()]) # Display null source values

           user_id source       email  purchase
22      1397217221    NaN  79ac569f0b         0
49      5062457902    NaN  9ddce3a861         0
171     6724868284    NaN  c0e48c7cf8         0
258     3221384063    NaN  7fe8da1823         0
379     7515782311    NaN  462462af10         0
...            ...    ...         ...       ...
199342  3439213943    NaN  7edda4e2a4         0
199661  9473123762    NaN  3535509f51         0
199689   722485056    NaN  470ffa3800         0
199709  5950023506    NaN  0fb749d485         0
199758  3747926428    NaN  604850216f         0

[1674 rows x 4 columns]


In [9]:
print(df_logs[(~df_logs['email'].isna()) & (df_logs['source'] == 'email')]) # Display rows where email is not null and source is 'email'

           user_id source       email  purchase
1       5644686960  email  c129aa540a         0
11      8623045648  email  d6d19c571c         0
18      5739438900  email  19379ee49c         0
19      7486955288  email  09c27794fa         0
33      7298923004  email  1fe184ed73         0
...            ...    ...         ...       ...
199922  4075894991  email  2c9a202435         0
199958  9794381984  email  85712b433a         0
199970  3396355438  email  4bba3fde78         0
199979  5008169696  email  e5128e15fd         0
199989  9470921783  email  3977de6aaa         0

[12279 rows x 4 columns]


In [10]:
df_emails = df_logs[~df_logs['email'].isna()] # Filter rows where email is not null

print(df_emails.head(10))

       user_id source       email  purchase
1   5644686960  email  c129aa540a         0
11  8623045648  email  d6d19c571c         0
18  5739438900  email  19379ee49c         0
19  7486955288  email  09c27794fa         0
22  1397217221    NaN  79ac569f0b         0
33  7298923004  email  1fe184ed73         0
43  6034222291  email  fb58a27f03         0
49  5062457902    NaN  9ddce3a861         0
56  5690036640  email  a088a48182         0
66  9963049355  email  9cc43ebd15         0


In [None]:
df_emails = df_logs[(df_logs["email"].isna()) & (df_logs["source"].isna())] # Filter rows where email and source are both null

print(df_emails)

##### __Fill in missing categorical values__

##### _Quantitative vs. Categorical Variables_

Quantitative variables have numerical values ​​that we can use for arithmetic calculations, for example, height, weight, age, and income. In Python, these values ​​tend to be stored as integers or floats.

Categorical variables represent a set of possible values ​​that a particular observation can have, for example, the color, make, and model of a car. In Python, these values ​​tend to be stored as strings, but they can also be Boolean values ​​or even integers.

Some examples of integer categorical values ​​are postal codes or numerical labels that represent other values ​​(e.g., 1 = red, 2 = blue, etc.). In either case, it doesn't make sense to perform arithmetic operations on categorical values.

The way we fill in missing values ​​depends on whether they are quantitative or categorical.

__keep_default_na=False__ makes NaN values become _empty strings_. Work as .fillna() <a index='keppDefaultNA'></a>

In [None]:
import pandas as pd

df_logs = pd.read_csv('DataSets/visit_log.csv', keep_default_na=False) # Read CSV file without default NA values, work as .fillna()

print(df_logs.head())
print()
df_logs.info()

      user_id   source       email  purchase
0  7141786820    other                     0
1  5644686960    email  c129aa540a         0
2  1914055396  context                     0
3  4099355752    other                     0
4  6032477554  context                     1

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 4 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   user_id   200000 non-null  int64 
 1   source    200000 non-null  object
 2   email     200000 non-null  object
 3   purchase  200000 non-null  int64 
dtypes: int64(2), object(2)
memory usage: 6.1+ MB


##### __Excercise__

In [13]:
import pandas as pd
df_logs = pd.read_csv('DataSets/visit_log.csv')

df_logs['email'] = df_logs['email'].fillna(value='')
print(df_logs.head())

      user_id   source       email  purchase
0  7141786820    other                     0
1  5644686960    email  c129aa540a         0
2  1914055396  context                     0
3  4099355752    other                     0
4  6032477554  context                     1


In [15]:
import pandas as pd

df_logs = pd.read_csv('DataSets/visit_log.csv', keep_default_na=False)

df_sources = df_logs['source'].unique()
print(df_sources)

['other' 'email' 'context' '' 'undef']


In [17]:
import pandas as pd

df_logs = pd.read_csv('DataSets/visit_log.csv', keep_default_na=False)

df_logs['source'] = df_logs['source'].replace('', 'email')
print(df_logs['source'].unique())

['other' 'email' 'context' 'undef']


In [19]:
import pandas as pd

df_logs = pd.read_csv('DataSets/visit_log.csv', keep_default_na=False)
df_logs['source'] = df_logs['source'].replace('', 'email')

visits = df_logs.groupby('source')['user_id'].count()
print(visits)

source
context     52032
email       13953
other      133834
undef         181
Name: user_id, dtype: int64


##### _.groupby()_ <a index='groupBy'></a>

The .groupby() method in Pandas splits your DataFrame into groups based on a particular column (or columns), then allows you to apply some operation to each group — such as calculating a sum, mean, count, or even custom functions.

_df.groupby('column_to_group_by')['column_to_aggregate'].aggregation_function()_

__Common Aggregation Functions__

You can use:

.sum() — total

.mean() — average

.count() — number of rows

.max() / .min() — highest/lowest

.median(), .std(), .nunique(), etc.

In [4]:
import pandas as pd

data = {
    'store': ['A', 'A', 'B', 'B', 'B', 'C'],
    'month': ['january', 'january', 'octuber', 'june', 'february', 'december'],
    'sales': [100, 200, 50, 300, 150, 400]
}

df = pd.DataFrame(data)

print(df)
print()

print(df.groupby('store')['sales'].sum())
print()
print(df.groupby(['store', 'month'])['sales'].sum())
print()
print(df.groupby('store')['sales'].agg(['sum', 'mean', 'count']))
print()
df.groupby("month", as_index=False)["sales"].mean()


  store     month  sales
0     A   january    100
1     A   january    200
2     B   octuber     50
3     B      june    300
4     B  february    150
5     C  december    400

store
A    300
B    500
C    400
Name: sales, dtype: int64

store  month   
A      january     300
B      february    150
       june        300
       octuber      50
C      december    400
Name: sales, dtype: int64

       sum        mean  count
store                        
A      300  150.000000      2
B      500  166.666667      3
C      400  400.000000      1



Unnamed: 0,month,sales
0,december,400.0
1,february,150.0
2,january,150.0
3,june,300.0
4,octuber,50.0


In [21]:
import pandas as pd

df_logs = pd.read_csv('DataSets/visit_log.csv', keep_default_na=False)
df_logs['source'] = df_logs['source'].replace('', 'email')

purchases = df_logs.groupby('source')['purchase'].sum()
print(purchases)

source
context    3029
email      1021
other      8041
undef        12
Name: purchase, dtype: int64


In [23]:
import pandas as pd

df_logs = pd.read_csv('DataSets/visit_log.csv', keep_default_na=False)
df_logs['source'] = df_logs['source'].replace('', 'email')

visits = df_logs.groupby('source')['user_id'].count()
purchases = df_logs.groupby('source')['purchase'].sum()

conversion = purchases / visits
print(conversion)

source
context    0.058214
email      0.073174
other      0.060082
undef      0.066298
dtype: float64


##### __Fill in missing quantitative values__

In [25]:
import pandas as pd

analytics_data = pd.read_csv('DataSets/web_analytics_data.csv')
print(analytics_data.head(10))

      user_id device_type   age    time
0  7141786820     desktop  33.0  2127.0
1  5644686960      mobile  30.0    35.0
2  1914055396     desktop  25.0     NaN
3  4099355752     desktop  25.0  2123.0
4  6032477554     desktop  27.0    59.0
5  5872473344      mobile  27.0     NaN
6  7977025176      mobile   NaN     NaN
7  3512872755     desktop  40.0    65.0
8  1827368713     desktop  37.0     NaN
9  8688870165     desktop  36.0  2124.0


mean() used when there are no atypical significant values

median() used when there are atypical significant values

The mean is not a good typical value when the data you are working with has significant outliers. For example, suppose five employees at a company have salaries of $30,000. Both the mean and median are equal to $30,000.

Then, a marketing director is hired with a salary of $90,000. The mean has risen to $40,000, while the median remains at $30,000.

This outlier makes the median a better indicator of typical salary than the mean.

In [28]:
import pandas as pd

analytics_data = pd.read_csv('DataSets/web_analytics_data.csv')

age_avg = analytics_data['age'].mean()
analytics_data['age'] = analytics_data['age'].fillna(age_avg)

desktop_data = analytics_data[analytics_data['device_type'] == 'desktop']
mobile_data =  analytics_data[analytics_data['device_type'] == 'mobile']

desktop_data.info()
print()
mobile_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 73764 entries, 0 to 99999
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   user_id      73764 non-null  int64  
 1   device_type  73764 non-null  object 
 2   age          73764 non-null  float64
 3   time         61588 non-null  float64
dtypes: float64(2), int64(1), object(1)
memory usage: 2.8+ MB

<class 'pandas.core.frame.DataFrame'>
Index: 26236 entries, 1 to 99997
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   user_id      26236 non-null  int64  
 1   device_type  26236 non-null  object 
 2   age          26236 non-null  float64
 3   time         13823 non-null  float64
dtypes: float64(2), int64(1), object(1)
memory usage: 1.0+ MB


In [30]:
import pandas as pd

analytics_data = pd.read_csv('DataSets/web_analytics_data.csv')

age_avg = analytics_data['age'].mean()
analytics_data['age'] = analytics_data['age'].fillna(age_avg)

desktop_data = analytics_data[analytics_data['device_type'] == 'desktop']
mobile_data =  analytics_data[analytics_data['device_type'] == 'mobile']

desktop_avg = desktop_data['time'].mean()
mobile_avg = mobile_data['time'].mean()

print(f"Tiempo de escritorio promedio: {desktop_avg:.2f} segundos")
print(f"Tiempo móvil promedio: {mobile_avg:.2f} segundos")

Tiempo de escritorio promedio: 1741.87 segundos
Tiempo móvil promedio: 41.16 segundos


In [34]:
import pandas as pd
pd.options.mode.chained_assignment = None
import warnings
warnings.filterwarnings('ignore')

analytics_data = pd.read_csv('DataSets/web_analytics_data.csv')

age_avg = analytics_data['age'].mean()
analytics_data['age'] = analytics_data['age'].fillna(age_avg)

desktop_data = analytics_data[analytics_data['device_type'] == 'desktop']
mobile_data =  analytics_data[analytics_data['device_type'] == 'mobile']

desktop_avg = desktop_data['time'].mean()
mobile_avg = mobile_data['time'].mean()

desktop_data['time'] = desktop_data['time'].fillna(desktop_avg)
mobile_data['time'] = mobile_data['time'].fillna(mobile_avg)

# esto comprobará si tienes algún valor ausente
desktop_data.info()
print()
mobile_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 73764 entries, 0 to 99999
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   user_id      73764 non-null  int64  
 1   device_type  73764 non-null  object 
 2   age          73764 non-null  float64
 3   time         73764 non-null  float64
dtypes: float64(2), int64(1), object(1)
memory usage: 2.8+ MB

<class 'pandas.core.frame.DataFrame'>
Index: 26236 entries, 1 to 99997
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   user_id      26236 non-null  int64  
 1   device_type  26236 non-null  object 
 2   age          26236 non-null  float64
 3   time         26236 non-null  float64
dtypes: float64(2), int64(1), object(1)
memory usage: 1.0+ MB


##### __Excercise__

In this activity, we'll define a DataFrame. Your work will consist of the following:

- Filter the rows for the 'North' and 'South' regions.

- Calculate the average income for each region.

- Use info() to verify the changes.

In [36]:
import pandas as pd

data = {
    'user_id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'region': ['North', 'South', 'North', 'South', 'North', 'South', 'North', 'South', 'North', 'South'],
    'age': [25, 34, 45, None, 38, 50, None, 28, 42, 35],
    'revenue': [120, 80, 130, 95, None, None, 125, 90, None, 110]
}

sales_data = pd.DataFrame(data)

north_data = sales_data[sales_data['region'] == 'North']
south_data = sales_data[sales_data['region'] == 'South']

north_avg = north_data['revenue'].mean()
south_avg = south_data['revenue'].mean()

print(f"Promedio de ingresos en 'North': {north_avg}")
print(f"Promedio de ingresos en 'South': {south_avg}")

north_data.info()
print()
south_data.info()

Promedio de ingresos en 'North': 125.0
Promedio de ingresos en 'South': 93.75
<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, 0 to 8
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   user_id  5 non-null      int64  
 1   region   5 non-null      object 
 2   age      4 non-null      float64
 3   revenue  3 non-null      float64
dtypes: float64(2), int64(1), object(1)
memory usage: 200.0+ bytes

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, 1 to 9
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   user_id  5 non-null      int64  
 1   region   5 non-null      object 
 2   age      4 non-null      float64
 3   revenue  4 non-null      float64
dtypes: float64(2), int64(1), object(1)
memory usage: 200.0+ bytes


We'll continue working on the same case as in exercise 1. This time, using the same DataFrame, you'll use the average income for the 'North' region to fill in the missing values ​​in that region, and the same for the 'South' region. Additionally, display the averages calculated for each region.

In [None]:
import pandas as pd
pd.options.mode.chained_assignment = None
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn



data = {
    'user_id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'region': ['North', 'South', 'North', 'South', 'North', 'South', 'North', 'South', 'North', 'South'],
    'age': [25, 34, 45, None, 38, 50, None, 28, 42, 35],
    'revenue': [120, 80, 130, 95, None, None, 125, 90, None, 110]
}

# Convertir a DataFrame
sales_data = pd.DataFrame(data)

# Filtrar los datos por región
north_data = sales_data[sales_data['region'] == 'North']
south_data = sales_data[sales_data['region'] == 'South']

# Calcular el promedio de ingresos por región
north_avg = north_data['revenue'].mean()
south_avg = south_data['revenue'].mean()

# Imprimir los promedios calculados
print(f"Promedio de ingresos en 'North': {north_avg}")
print(f"Promedio de ingresos en 'South': {south_avg}")

# Rellenar los valores ausentes con el promedio de ingresos por región
north_data['revenue'] = north_data['revenue'].fillna(north_avg)
south_data['revenue'] = south_data['revenue'].fillna(south_avg)

# Comprobar si aún hay valores ausentes
north_data.info()
print()
south_data.info()