# **Week 2 Applied Session: Introduction to DataWrangling with Pandas**

![](https://media.licdn.com/dms/image/D5612AQEdvrs4ha4KAQ/article-cover_image-shrink_600_2000/0/1693676526923?e=2147483647&v=beta&t=aegFlNZu0P_4UKcfh4ZTol_MmcIQzqoZx5tOKKMkI1E)

Pandas is a strange name, kind of an acronym: Python, Numerical, Data Analysis?

Because pandas is an external library you need to import it. There are several ways that you will see imports done:
- import pandas
- from pandas import tools
- import pandas as `pd`

The first is the same as `from pandas import *` where star means all (that's right, the same as SQL)

The second imports a part of pandas only, a sublibrary called *tools*

The third is a renaming, or alias, `pd` is common (you could call pandas `xyz` but you'd be on your own).

You could leave out the `import as` and just type `pandas` every time but it becomes more useful for longer names e.g. `import matplotlib.pyplot as plt`

So, for any code following (if the above imports work), `plt` would mean `matplotlib.pyplot`

This, by the way, is a Python Notebook, select cells (this one is text, below is code) then `SHIFT-ENTER` to run sequentially

The following scripts should work with both Python2 and Python 3!

In [None]:
# import libraries first
import pandas as pd
import numpy as np # Numberical Python

## **1. The Pandas DataFrame**

In [None]:
# and make one of these dataframes...
dataframe()

In [None]:
# oops, try another spelling
Dataframe()

In [None]:
# no good? Try the library
pd.dataframe()

### Errors
<font color = "green">`module` object has no attribute `dataframe`<br></font>
is better than<br>
<font color = "green">name `Dataframe` is not defined<br></font>
but neither are working...


In [None]:
# so try pandas.DataFrame()
pd.DataFrame()

So.. no errors, seems to have worked, but what's in the DataFrame? (nothing)

**note**: Python is case sensitive: `DataFrame` is not the same as `Dataframe` or `dataframe`

In [None]:
pd.DataFrame([2,4,6,8])

In [None]:
# aha, better but this is temporary, if you want to use the data you need to save it, so create a variable
df = pd.DataFrame([2,4,6,8])

In [None]:
# but now there's no output... can't win
# use the variable to see the data
df

**Note**: the column titles are ` ` and `0`

And another note: Python is one of those `0` index languages, we have 4 items `(2,4,6,8)` but they are found at `0,1,2,3` viz:

In [None]:
# you can get the values with its index:
df[0][1] # column 0, item 1

In [None]:
# rename the column
df.columns.name = "Index"
df

In [None]:
# You can also use pandas to create an series of datetime objects. Let's make one for the week beginning January 25th, 2015:
dates = pd.date_range('20150125', periods=7)

dates

Now we'll create a DataFrame using the dates array as our index, fill it with some random values using numpy, and give the columns some labels.

Note that `randn(7,5)` below matches the 7 dates (rows) and 5 names (columns)

(Otherwise it wouldn't work, try changing 5 to 6...)

In [None]:
df = pd.DataFrame(np.random.randn(7,5), index=dates, columns=['Adam','Bob','Carla','Dave','Eve'])
df

DataFrames are more flexible than that, both in terms of what you can store in them and what you can do with them.

It can also be useful to know how to create a DataFrame from a dict of objects.

This comes in particularly handy when working with JSON-like structures.

In [None]:
df2 = pd.DataFrame({ 'A' : np.random.random_sample(4), # 4 random numbers
                     'B' : pd.Timestamp('20130102'), # 4 dates, note pandas autofills
                     'C' : pd.date_range('20150125',periods = 4), # 4 dates in a range
                     'D' : ['a','b','c','d'], # letters
                     'E' : ["cat","dog","mouse","parrot"], # text/string
                     'F' : 'copy'}) # note pandas autofills

df2



---



## **2. Exploring the data in a DataFrame**

Let's import the UFO sightings dataset via URL. We can access the data types of each column in a DataFrame as follows:

In [None]:
ufo = pd.read_csv('http://bit.ly/uforeports')

We can display the index, columns and the underlyinig numpy data separately:

In [None]:
ufo.index

In [None]:
ufo.columns

We can accesss the data types of each column:

In [None]:
ufo.dtypes

In [None]:
ufo.values

We can get the size of the data using `shape`:

In [None]:
ufo.shape

We can get a quick statistical summary of the data using `describe()` function:

In [None]:
ufo.describe()

We can have a look at the data in first five rows using `head()` function:

In [None]:
ufo.head()

We can indicate how many row to return by specifying an integer:

In [None]:
ufo.head(20)

We can also have the last five rows using `tail()` function, check the index numbers:

In [None]:
ufo.tail()

We can focus on a specific column:

In [None]:
ufo['City']

We can select a subset of rows by integer indexing:

In [None]:
ufo[1:3]

**Note**：Only the rows wiht index 1 and 2 returned.

We can also select rows by specific values:

In [None]:
ufo[['City','State']][ufo.State == 'NJ']

**Note**: `ufo.State == 'NJ'` is an example of using conditional indexing. We can also have other conditions like `<`, `>`, `<=`, `>=` or `!=` (not equal).

In [None]:
ufo[['City','State']][ufo.State != 'NJ']

What is the difference between the following two functions `loc` and `iloc`?

In [None]:
ufo.loc[:,'City':'State'].head()

In [None]:
ufo.iloc[1:6,2:6]

Enter your answer here....



---



## **Task 1: Load data and get the basic information**

In this task, you are asked to load a data file from Google Drive with your Monash account. Open the file to create a Pandas dataframe and explore it using the functions introduced above and see what information you can get from the data.

### **Connect with your Google Drive to access files**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

After you run the cell, then it asks you to click on a url and login in order to give premissions to Colab. If you successfully followed the steps, you should now see a drive folder in the left pane of this notebook. see below figure.

![](https://drive.google.com/uc?export=download&id=1B2sooICEr_QDLEyFHOSwIQO89LODLfAq)

If you click on it, you should be able to see the "FIT5196_S1_2025" shared drive. If you are unable to see that, let us know ASAP. But if you can see it, then it means that now this notebook have access to everything on that shared drive. Let's read the `xmart` data from there.

In [None]:
xmart = pd.read_csv('/content/drive/Shareddrives/FIT5196_S1_2025/week2/xmart.csv',skiprows=1)


Now, it is your turn to write Python codes and try to find out:
1. How many records in the dataset?
2. How many attributes in the dataset? What are they?
3. What is the data type for each of the attribute?
4. Without any description provided, can you summarize what information contains in this dataset?

**Note**: Don't forget to use markdown to explain your findings.



---



## **3. Editing data in DataFrame**

We can apply basic editing operations to DataFrame objects, such as updating, deleting, duplication, adding new columns, insert new rows, and etc.

### **3.1 Dealing with missing values (simplest ways)**

Continue with ufo data, we are looking for the data rows having missing city data and simply remove them.

In [None]:
ufo[10:20]

In [None]:
# check whether the values is NaN or not
ufo.isna()

In [None]:
# find out how many missing values in each attribute
ufo.isna().sum()

In [None]:
# list all the rows with missing values
ufo[ufo.isna().any(axis=1)]

**Note**: The ufo dataset has 18,241 rows in total, but 15,755  of them have at least one missing value. We can simply remove all of them to keep the rows with complete data for next step analysis. To keep the original data, we create a new data frame to store the subset of data rows with complete values.

In [None]:
# remove data rows with missing values
ufo1 = ufo.dropna()
ufo1.head(20)

In [None]:
ufo1.isna().sum()

In this way, only a very small part of the data is kept and we may not have enough data to work out any useful knowledge. Instead of removing all the missing values, we can replace missing parts with some specific values. For example, set them all to zero.

In [None]:
ufo2 = ufo.fillna(value=0)
ufo2.head(20)

Oops... we have nominal attributes, not numeric. Zeros do not work!!!

First, let's copy the DataFrame to have a new one to work on.

In [None]:
ufo3 = ufo.copy()
ufo3[10:20]

In [None]:
ufo3['Colors Reported'].fillna('BLUE')
ufo3[10:20]

Why do we stil have the missing values? Nothing happened?

Let's try it again:

In [None]:
# Change the null values with 'BLUE' in 'Colors Reported'
ufo3['Colors Reported'].fillna('BLUE', inplace=True)
ufo3[10:20]

### **3.2 Create a new column**

Let's create a new column with the combined City and State place names, called `place` with an empty string in every row. This isn't absolutely necessary when using proper Pandas methods but for the demonstration it will make it more straight forward.

In [None]:
ufo4 = ufo.copy()
ufo4['place']=''
ufo4

**Note**: By default, the new column is always added at the end

Before we combine the city and state, we need to check whether there are missing values in these two columns. From above, we know the `City` column has 26 rows with missing values and the `State` is complete. So we need to fill the `City` column before merging.

In [None]:
ufo4['City'].fillna('No city', inplace=True)

In [None]:
ufo4['place'] = ufo4['City'] + ', ' + ufo4['State']
ufo4

We can use a `for` loop to achieve the same result.

Before we apply any operations that take use of index, remember to check the valid index range to avoid any errors.

In [None]:
ufo4.index

In [None]:
ufo4['address']=''
ufo4

In [None]:
# Using a for loop to create each entry in turn

for i in ufo4.index:
    ufo4.iloc[i,6] = ufo4.iloc[i,0] + ', ' + ufo4.iloc[i,3]

In [None]:
ufo4


We have the same values in both `place` and `address`. But, which way is better?

### **3.3 Timing it**

The notebook's magic `%%timeit` will run the cell 1000 times and get the 3 quickest times. We can use it to record the time and then do the comparison.

In [None]:
ufo5 = ufo.copy()
ufo5['City'].fillna('No city', inplace=True)
ufo5

In [None]:
ufo5['place']=''
ufo5['address']=''
ufo5.head()

In [None]:
%%timeit

ufo5['place'] = ufo5['City'] + ', ' + ufo5['State']

In [None]:
%%timeit

for i in ufo5.index:
    ufo5.iloc[i,6] = ufo5.iloc[i,0] + ', ' + ufo5.iloc[i,3]

You can run the above codes many times, they may give different time, but using `for` loop is a much slower method.

**Note**: Pandas is based on numpy arrays, so try everything you can to aviod iterating over rows.

### **3.4 Delete columns**

We only need to keep one column for merged place, we can easily drop one.

In [None]:
ufo5.drop(columns=['address'], inplace=True)
ufo5.head()



---



## **Task 2: Reproducing the data wrangling process**

Load an `Air Crashes` data, and try to answer the following questions:
1. Give a summary of the data, including size, attributes, data types
2. Does the dataset contain missing value? What are you going to deal with them?
3. Check if there are any columns can be merged together? Apply the merging operation.
4. Remove the column(s) that contains duplicated information after merging.
5. Find out a subset that records air crashes with survivors.

In [None]:
aircrash = pd.read_csv('/content/drive/Shareddrives/FIT5196_S1_2025/week2/AirCrashes.csv')
aircrash.head()