Readme:


We encourage you to explore more functionalities in 'Python for Data Analysis, 3E' by Wes McKinney, Chapter 7: 'Data Cleaning and Preparation'.</br>
Link: https://wesmckinney.com/book/data-cleaning

In [1]:
import pandas as pd
import numpy as np


<h3><b>Task 1 </b></h3>

1. Create a Series from list ["aardvark", np.nan, None, "avocado"] and observe how values 'np.nan' and 'None' are represented.  </br>
2. What is the data type of the Series? </br>
3. Then identify which values are considered 'NA' (Not available) by running isna() method. </br>

</p>


In [4]:
import numpy as np
import pandas as pd

# 1. Create the Series
data = pd.Series(["aardvark", np.nan, None, "avocado"])
print("Original Series:")
print(data)

# 2. Data type of the Series
print("\nData type of the Series:")
print(data.dtype)

# 3. Identify NA values
print("\nMissing values (NA) using isna():")
print(data.isna())

Original Series:
0    aardvark
1         NaN
2        None
3     avocado
dtype: object

Data type of the Series:
object

Missing values (NA) using isna():
0    False
1     True
2     True
3    False
dtype: bool


<h3><b>Task 2 </b></h3>
<p>
For data with float64 dtype, pandas uses the floating-point value NaN (Not a Number) to represent missing data.</br>
Create a Series from list [1, 2, None] by specifying data type as 'float64' - how the 'None' value will be represented here? Compare it with 'None' representation in the previous task. </br>
</p>


In [5]:
import pandas as pd
import numpy as np

# Create Series with float64 dtype
s = pd.Series([1, 2, None], dtype="float64")
print("Series with float64 dtype:")
print(s)

# Check which values are missing
print("\nMissing values (NA) using isna():")
print(s.isna())

# Data type of the Series
print("\nData type of the Series:")
print(s.dtype)


Series with float64 dtype:
0    1.0
1    2.0
2    NaN
dtype: float64

Missing values (NA) using isna():
0    False
1    False
2     True
dtype: bool

Data type of the Series:
float64


<h3><b>Task 3 </b></h3>
<p>
Create a Series from list [1, np.nan, 3.5, np.nan, 7] and drop the missing values. </br>
</p>


In [6]:
import pandas as pd
import numpy as np

# Create the Series
s = pd.Series([1, np.nan, 3.5, np.nan, 7])

# Drop missing values
s_dropped = s.dropna()

# Print the result
print("Original Series:")
print(s)
print("\nSeries after dropping missing values:")
print(s_dropped)


Original Series:
0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

Series after dropping missing values:
0    1.0
2    3.5
4    7.0
dtype: float64


<h3><b>Task 4 </b></h3>
<p>
1. Create a dataframe out of nested list [[1., 6.5, 3.], [1., np.nan, np.nan], [np.nan, np.nan, np.nan], [np.nan, 6.5, 3.]].</br>
2. Run dropna() method - will it drop a whole row containing at least one missing value or a row where all values are missing? </br>
</p>


In [7]:
import pandas as pd
import numpy as np

# 1. Create the DataFrame
df = pd.DataFrame([
    [1., 6.5, 3.],
    [1., np.nan, np.nan],
    [np.nan, np.nan, np.nan],
    [np.nan, 6.5, 3.]
])

print("Original DataFrame:")
print(df)

# 2. Drop rows with missing values
df_dropped = df.dropna()

print("\nDataFrame after dropna():")
print(df_dropped)


Original DataFrame:
     0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
2  NaN  NaN  NaN
3  NaN  6.5  3.0

DataFrame after dropna():
     0    1    2
0  1.0  6.5  3.0


<h3><b>Task 5 </b></h3>
<p>
Rewrite the dropna() method so it drops only the rows where ALL values are missing.</br>
</p>


In [8]:
import pandas as pd
import numpy as np

# Original DataFrame
df = pd.DataFrame([
    [1., 6.5, 3.],
    [1., np.nan, np.nan],
    [np.nan, np.nan, np.nan],
    [np.nan, 6.5, 3.]
])

print("Original DataFrame:")
print(df)

# Drop only rows where all values are NaN
df_cleaned = df.dropna(how='all')

print("\nDataFrame after dropping rows where all values are missing:")
print(df_cleaned)


Original DataFrame:
     0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
2  NaN  NaN  NaN
3  NaN  6.5  3.0

DataFrame after dropping rows where all values are missing:
     0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
3  NaN  6.5  3.0


<h3><b>Task 6 </b></h3>
<p>
1. Add column indexed as 3 where all values are NA. </br>
2. Drop the COLUMN where ALL values are NA.
</p>


In [9]:
import pandas as pd
import numpy as np

# Step 1: Create the initial DataFrame
df = pd.DataFrame([
    [1., 6.5, 3.],
    [1., np.nan, np.nan],
    [np.nan, np.nan, np.nan],
    [np.nan, 6.5, 3.]
])

# Step 1: Add column indexed as 3 (i.e., 4th column) where all values are NA
df[3] = np.nan

print("DataFrame with new all-NA column:")
print(df)

# Step 2: Drop the COLUMN where ALL values are NA
df_cleaned = df.dropna(axis=1, how='all')

print("\nDataFrame after dropping columns where all values are missing:")
print(df_cleaned)


DataFrame with new all-NA column:
     0    1    2   3
0  1.0  6.5  3.0 NaN
1  1.0  NaN  NaN NaN
2  NaN  NaN  NaN NaN
3  NaN  6.5  3.0 NaN

DataFrame after dropping columns where all values are missing:
     0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
2  NaN  NaN  NaN
3  NaN  6.5  3.0


<h3><b>Task 7 </b></h3>
<p>
1. Create a df from random float numbers with shape (7, 3). </br>
2. Assign NA to values at row index 0 to 3 inclusively and column index 1. </br>
3. Assign NA to values at row index 0 to 1 inclusively and column index 2. </br>
4. Fill NA values with 0. </br>
5. Fill NA values with 1 for column 1, and fill NA values with 2 for column 2 using a dictionary. </br>

</p>


In [10]:
import pandas as pd
import numpy as np

# 1. Create a DataFrame from random floats with shape (7, 3)
df = pd.DataFrame(np.random.randn(7, 3))
print("Original DataFrame:\n", df)

# 2. Assign NA to values at row index 0 to 3 and column index 1
df.loc[0:3, 1] = np.nan

# 3. Assign NA to values at row index 0 to 1 and column index 2
df.loc[0:1, 2] = np.nan

print("\nDataFrame with NaNs:\n", df)

# 4. Fill all NA values with 0
df_filled_all_zeros = df.fillna(0)
print("\nNA filled with 0:\n", df_filled_all_zeros)

# 5. Fill NA values with 1 for column 1, and 2 for column 2 using a dictionary
df_filled_custom = df.fillna({1: 1, 2: 2})
print("\nNA filled with {1:1, 2:2}:\n", df_filled_custom)


Original DataFrame:
           0         1         2
0  1.557082  2.482382 -0.325690
1 -1.333851 -0.480262 -1.085746
2 -0.230311 -0.629006  0.982608
3 -0.691996  1.161170 -0.484433
4  1.188529  1.047122  0.202400
5 -0.812344  1.046497 -0.027518
6 -0.022648  0.063043  0.133190

DataFrame with NaNs:
           0         1         2
0  1.557082       NaN       NaN
1 -1.333851       NaN       NaN
2 -0.230311       NaN  0.982608
3 -0.691996       NaN -0.484433
4  1.188529  1.047122  0.202400
5 -0.812344  1.046497 -0.027518
6 -0.022648  0.063043  0.133190

NA filled with 0:
           0         1         2
0  1.557082  0.000000  0.000000
1 -1.333851  0.000000  0.000000
2 -0.230311  0.000000  0.982608
3 -0.691996  0.000000 -0.484433
4  1.188529  1.047122  0.202400
5 -0.812344  1.046497 -0.027518
6 -0.022648  0.063043  0.133190

NA filled with {1:1, 2:2}:
           0         1         2
0  1.557082  1.000000  2.000000
1 -1.333851  1.000000  2.000000
2 -0.230311  1.000000  0.982608
3 -0.691996

<h3><b>Task 8 </b></h3>
<p>
1. Run below code and display the result. </br>
2. Return a Boolean Series indicating whether a row is a duplicate. </br>
3. Return a dataframe where the duplicated rows are dropped. </br>
4. Return a dataframe where rows are dropped only if we have duplicates in column k2. </br>

</p>


In [11]:
import pandas as pd

# 1. Create the DataFrame
df = pd.DataFrame({
    "k1": ["one", "two"] * 3 + ["two"],
    "k2": [1, 1, 2, 3, 3, 4, 4]
})
print("Original DataFrame:\n", df)

# 2. Return a Boolean Series indicating whether a row is a duplicate
duplicates_bool = df.duplicated()
print("\nBoolean Series (duplicated rows):\n", duplicates_bool)

# 3. Return a DataFrame where the duplicated rows are dropped
df_no_duplicates = df.drop_duplicates()
print("\nDataFrame without any duplicated rows:\n", df_no_duplicates)

# 4. Return a DataFrame where rows are dropped only if we have duplicates in column 'k2'
df_no_dup_k2 = df.drop_duplicates(subset='k2')
print("\nDataFrame without duplicates in column 'k2':\n", df_no_dup_k2)


Original DataFrame:
     k1  k2
0  one   1
1  two   1
2  one   2
3  two   3
4  one   3
5  two   4
6  two   4

Boolean Series (duplicated rows):
 0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

DataFrame without any duplicated rows:
     k1  k2
0  one   1
1  two   1
2  one   2
3  two   3
4  one   3
5  two   4

DataFrame without duplicates in column 'k2':
     k1  k2
0  one   1
2  one   2
3  two   3
5  two   4


<h3><b>Task 9 </b></h3>
<p>
Add a new column called 'animal' to below dataframe by mapping meat_to_animal to it. </br>
</p>


In [12]:
import pandas as pd

df = pd.DataFrame({
    "food": ["bacon", "pulled pork", "bacon",
             "pastrami", "corned beef", "bacon",
             "pastrami", "honey ham", "nova lox"],
    "ounces": [4, 3, 12, 6, 7.5, 8, 3, 5, 6]
})

meat_to_animal = {
    "bacon": "pig",
    "pulled pork": "pig",
    "pastrami": "cow",
    "corned beef": "cow",
    "honey ham": "pig",
    "nova lox": "salmon"
}

# Map the 'food' column using meat_to_animal dictionary
df['animal'] = df['food'].map(meat_to_animal)

print(df)


          food  ounces  animal
0        bacon     4.0     pig
1  pulled pork     3.0     pig
2        bacon    12.0     pig
3     pastrami     6.0     cow
4  corned beef     7.5     cow
5        bacon     8.0     pig
6     pastrami     3.0     cow
7    honey ham     5.0     pig
8     nova lox     6.0  salmon


<h3><b>Task 10 </b></h3>
<p>
Tip: </br>
- 'map' works element-wise on a Series; </br>
- 'apply' works on a row / column basis of a DataFrame; </br>
- 'applymap' works element-wise on a DataFrame; </br> </br>

We could achieve the same result by mapping below function to the df - run below and analyze the result. </br>

</p>


In [14]:
def get_animal(x):
    return meat_to_animal[x]

df['animal'] = df.food.map(get_animal)

<h3><b>Task 11 </b></h3>
<p>
As you've already seen, 'map' can be used to modify a subset of values in an object, but 'replace' provides a simpler and more flexible way to do so. </br>
Given below Series replace value -999 with 0, and replace value -1000 with np.nan using replace() method.</br>

</p>


In [16]:
import pandas as pd
import numpy as np

s = pd.Series([1., -999., 2., -999., -1000., 3.])

# Replace -999 with 0 and -1000 with np.nan
s_replaced = s.replace({-999: 0, -1000: np.nan})

print(s_replaced)


0    1.0
1    0.0
2    2.0
3    0.0
4    NaN
5    3.0
dtype: float64


<h3><b>Task 12</b></h3>
<p>
Like values in a Series, axis labels can be similarly transformed by a function or mapping of some form to produce new, differently labeled objects.  </br>
1. Given below dataframe, create and map a custom function that capitalizes the index values. </br>
2. Modify the dataframe in place by assigning the new index to it. </br>

</p>


In [17]:
import pandas as pd
import numpy as np

# Create the DataFrame
df = pd.DataFrame(np.arange(12).reshape((3, 4)),
                  index=["Ohio", "Colorado", "New York"],
                  columns=["one", "two", "three", "four"])

# 1. Create a custom function to capitalize index values
def capitalize_index(name):
    return name.upper()

# 2. Modify the DataFrame in place
df.index = df.index.map(capitalize_index)

print(df)


          one  two  three  four
OHIO        0    1      2     3
COLORADO    4    5      6     7
NEW YORK    8    9     10    11


<h3><b>Task 13 </b></h3>
<p>
If you want to create a transformed version of a dataset without modifying the original, a useful method is 'rename'. </br>
Create a transformed version of the above dataframe by using a rename() method that capitalizes all column names. </br>

</p>


In [18]:
import pandas as pd
import numpy as np

# Original DataFrame
df = pd.DataFrame(np.arange(12).reshape((3, 4)),
                  index=["Ohio", "Colorado", "New York"],
                  columns=["one", "two", "three", "four"])

# Create a transformed version with capitalized column names
df_transformed = df.rename(columns=str.upper)

print(df_transformed)


          ONE  TWO  THREE  FOUR
Ohio        0    1      2     3
Colorado    4    5      6     7
New York    8    9     10    11


<h3><b>Task 14 </b></h3>
<p>
Now rename the above dataframe so that index "COLORADO" modifies to "FOO", and column "two" modifies to "bar". </br>
</p>


In [20]:
import pandas as pd
import numpy as np

# Original DataFrame
df = pd.DataFrame(np.arange(12).reshape((3, 4)),
                  index=["Ohio", "Colorado", "New York"],
                  columns=["one", "two", "three", "four"])

# First capitalize the index and columns (as from previous steps)
df.index = df.index.str.upper()
df.columns = df.columns.str.lower()

# Rename specific index and column
df_renamed = df.rename(index={"COLORADO": "FOO"}, columns={"two": "bar"})

print(df_renamed)


          one  bar  three  four
OHIO        0    1      2     3
FOO         4    5      6     7
NEW YORK    8    9     10    11


<h3><b>Task 15 </b></h3>
Regular Expressions.</br>
Run below code and analyze the result.

In [21]:
import re
text = "foo    bar\t baz  \tqux"
re.split(r"\s+", text)

['foo', 'bar', 'baz', 'qux']


<h3><b>Task 16</b></h3>
When you call re.split(r"\s+", text), the regular expression is first compiled, and then its split method is called on the passed text. </br>
1. Now compile the regex yourself with re.compile, forming a reusable regex object.</br>
2. Apply the compiled regex object to the 'text' string.</br>
3. Now get a list of all patterns matching the compiled regex object, using the findall method</br></br>

*Creating a regex object with re.compile is highly recommended if you intend to apply the same expression to many strings; doing so will save CPU cycles.</br>
*'match' and 'search' are closely related to 'findall'. While 'findall' returns all matches in a string, 'search' returns only the first match. More rigidly, 'match' only matches at the beginning of the string. 

In [22]:
import re

# Original text
text = "foo    bar\t baz  \tqux"

# 1. Compile the regex
pattern = re.compile(r"\s+")

# 2. Apply the compiled regex to split the text
split_result = pattern.split(text)
print("Split Result:", split_result)

# 3. Get all whitespace patterns using findall
matches = pattern.findall(text)
print("Whitespace Matches:", matches)


Split Result: ['foo', 'bar', 'baz', 'qux']
Whitespace Matches: ['    ', '\t ', '  \t']


<h3><b>Task 17 </b></h3>
<p>
String Functions in pandas.</br>
String and regular expression methods can be applied (passing a lambda or other function) to each value using data.map, but it will fail on the NA (null) values!</br>
To cope with this, Series has array-oriented methods for string operations that skip over and propagate NA values. </br>
These are accessed through Series’s 'str' attribute.</br>
For example, we could check whether each email address has "gmail" in it with str.contains() as shown below. </br>
</p>


In [23]:
import pandas as pd
import numpy as np

# Sample data with some email addresses and a missing value
data = {
    "Dave": "dave@google.com",
    "Steve": "steve@gmail.com",
    "Rob": "rob@gmail.com",
    "Wes": np.nan
}

# Create a Series from the dictionary
s = pd.Series(data)

# Use the .str.contains() method to check if 'gmail' is in each string
result = s.str.contains('gmail')

print(result)


Dave     False
Steve     True
Rob       True
Wes        NaN
dtype: object


<h3><b>Task 18 </b></h3>
<p>
Note that the result of this operation has an object dtype. </br>
pandas has extension types that provide for specialized treatment of strings, integers, and Boolean data.</br>
Run below code and pay attention to the dtype.</br>
These 'string' arrays generally use much less memory and are frequently computationally more efficient for doing operations on large datasets. </br>
</p>


In [25]:
import pandas as pd
import numpy as np

# Original data with one missing value
data = {
    "Dave": "dave@google.com",
    "Steve": "steve@gmail.com",
    "Rob": "rob@gmail.com",
    "Wes": np.nan
}

# Create Series
s = pd.Series(data)

# Convert to string extension type
data_as_string_ext = s.astype('string')

# Display the result and dtype
print(data_as_string_ext)
print("\nDtype:", data_as_string_ext.dtype)


Dave     dave@google.com
Steve    steve@gmail.com
Rob        rob@gmail.com
Wes                 <NA>
dtype: string

Dtype: string


<h3><b>Task 19 </b></h3>
<p>
Regular expressions can be used, too, along with any re options like IGNORECASE. </br>
Analyze below pattern, run the code and pay attention to the syntax. </br>

</p>


In [26]:
import pandas as pd
import numpy as np
import re

# Sample data
data = {
    "Dave": "dave@google.com",
    "Steve": "steve@gmail.com",
    "Rob": "rob@gmail.com",
    "Wes": np.nan
}

s = pd.Series(data)

# Regex pattern to extract parts of an email address
pattern = r"([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})"

# Apply the pattern using findall with IGNORECASE
matches = s.str.findall(pattern, flags=re.IGNORECASE)

# Display results
print(matches)


Dave     [(dave, google, com)]
Steve    [(steve, gmail, com)]
Rob        [(rob, gmail, com)]
Wes                        NaN
dtype: object


<h3><b>Task 20 </b></h3>
<p>
There are a couple of ways to do vectorized element retrieval. Either use str.get() or str.index().</br>
Run below code and analyze the result. </br>
</p>


In [27]:
matches = s.str.findall(pattern, flags=re.IGNORECASE).str[0].str.get(1)
print(matches)
print(s.str[:5])

Dave     google
Steve     gmail
Rob       gmail
Wes         NaN
dtype: object
Dave     dave@
Steve    steve
Rob      rob@g
Wes        NaN
dtype: object


<h3><b>Task 21 </b></h3>
<p>
The str.extract() method will return the captured groups of a regular expression as a DataFrame.</br>
Run below code and analyze the result. </br>
</p>


In [28]:
s.str.extract(pattern, flags = re.IGNORECASE)

Unnamed: 0,0,1,2
Dave,dave,google,com
Steve,steve,gmail,com
Rob,rob,gmail,com
Wes,,,
