# **Week 2 Applied Session: Introduction to DataWrangling with Pandas**

![](https://media.licdn.com/dms/image/D5612AQEdvrs4ha4KAQ/article-cover_image-shrink_600_2000/0/1693676526923?e=2147483647&v=beta&t=aegFlNZu0P_4UKcfh4ZTol_MmcIQzqoZx5tOKKMkI1E)

# Pandas is a strange name, kind of an acronym: Python, Numerical, Data Analysis?

Because pandas is an external library you need to import it. There are several ways that you will see imports done:
- import pandas
- from pandas import tools
- import pandas as `pd`

The first is the same as `from pandas import *` where star means all (that's right, the same as SQL)

The second imports a part of pandas only, a sublibrary called *tools*

The third is a renaming, or alias, `pd` is common (you could call pandas `xyz` but you'd be on your own).

You could leave out the `import as` and just type `pandas` every time but it becomes more useful for longer names e.g. `import matplotlib.pyplot as plt`

So, for any code following (if the above imports work), `plt` would mean `matplotlib.pyplot`

This, by the way, is a Python Notebook, select cells (this one is text, below is code) then `SHIFT-ENTER` to run sequentially

The following scripts should work with both Python2 and Python 3!

In [2]:
# import libraries first
import pandas as pd
import numpy as np # Numberical Python

## **1. The Pandas DataFrame**

In [3]:
# and make one of these dataframes...
dataframe()

NameError: name 'dataframe' is not defined

In [4]:
# oops, try another spelling
Dataframe()

NameError: name 'Dataframe' is not defined

In [5]:
# no good? Try the library
pd.dataframe()

AttributeError: module 'pandas' has no attribute 'dataframe'

### Errors
<font color = "green">`module` object has no attribute `dataframe`<br></font>
is better than<br>
<font color = "green">name `Dataframe` is not defined<br></font>
but neither are working...


In [6]:
# so try pandas.DataFrame()
pd.DataFrame()

So.. no errors, seems to have worked, but what's in the DataFrame? (nothing)

**note**: Python is case sensitive: `DataFrame` is not the same as `Dataframe` or `dataframe`

In [7]:
pd.DataFrame([2,4,6,8])

Unnamed: 0,0
0,2
1,4
2,6
3,8


In [8]:
# aha, better but this is temporary, if you want to use the data you need to save it, so create a variable
df = pd.DataFrame([2,4,6,8])

In [9]:
# Creating a DataFrame with 3 rows and 4 columns.
# Row indices are set to [4, 5, 6], and column names are ['a', 'b', 'c', 'd'].
df2 = pd.DataFrame([[1,2,3,4],[2,4,6,8],[1,2,3,4]], index=[4,5,6], columns=['a','b','c','d'])
df2

Unnamed: 0,a,b,c,d
4,1,2,3,4
5,2,4,6,8
6,1,2,3,4


In [10]:
df3 = pd.DataFrame([[2,4],[1,2]],[6,8],['a','b'])
df3

Unnamed: 0,a,b
6,2,4
8,1,2


In [11]:
# but now there's no output... can't win
# use the variable to see the data
df

Unnamed: 0,0
0,2
1,4
2,6
3,8


**Note**: the column titles are ` ` and `0`

And another note: Python is one of those `0` index languages, we have 4 items `(2,4,6,8)` but they are found at `0,1,2,3` viz:

In [12]:
# you can get the values with its index:
df[0][1] # column 0, item 1

np.int64(4)

In [13]:
# Sample DataFrame
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'City': ['NY', 'LA', 'Chicago', 'Houston']
})

# The DataFrame looks like this:
#     Name   Age     City
# 0  Alice    25       NY
# 1    Bob    30       LA
# 2 Charlie    35  Chicago
# 3  David    40   Houston
# Print column names
print(df)
print("---------------------------------")
print("Column names:")
print(df.columns.names)   # Converts to list for cleaner output
print(df.columns.tolist())   # Converts to list for cleaner output
# Rename all columns (must match the number of columns exactly)
df.columns = ['Full Name', 'Years', 'Location']

print("\nDataFrame after renaming columns:")
print("Column names:")
print(df.columns.names)   # Converts to list for cleaner output
print(df.columns.tolist())
print(df)
print("---------------------------------")

# Rename specific columns using a dictionary
df.rename(columns={'Years': 'Age'}, inplace=True)

print("\nAfter renaming one column (Years ➝ Age):")


print(df)
print("---------------------------------")
# Add a new column called 'Country' with the same value for all rows
df['Country'] = 'USA'

print("\nAfter adding new column 'Country':")
print(df)
print("-----------------------------------")
# Add a column based on existing data
df['Age in 5 Years'] = df['Age'] + 5

print("\nAfter adding 'Age in 5 Years' column:")
print(df)
print("-----------------------------------")


      Name  Age     City
0    Alice   25       NY
1      Bob   30       LA
2  Charlie   35  Chicago
3    David   40  Houston
---------------------------------
Column names:
[None]
['Name', 'Age', 'City']

DataFrame after renaming columns:
Column names:
[None]
['Full Name', 'Years', 'Location']
  Full Name  Years Location
0     Alice     25       NY
1       Bob     30       LA
2   Charlie     35  Chicago
3     David     40  Houston
---------------------------------

After renaming one column (Years ➝ Age):
  Full Name  Age Location
0     Alice   25       NY
1       Bob   30       LA
2   Charlie   35  Chicago
3     David   40  Houston
---------------------------------

After adding new column 'Country':
  Full Name  Age Location Country
0     Alice   25       NY     USA
1       Bob   30       LA     USA
2   Charlie   35  Chicago     USA
3     David   40  Houston     USA
-----------------------------------

After adding 'Age in 5 Years' column:
  Full Name  Age Location Country  Age in 5 

In [11]:


print(df)
print("--------------------------------------")
# 1. Accessing a single row by position
row_0 = df.iloc[1]
# Returns the first row: Alice's data
print("Accessing a single row by position df.iloc[0]  ",row_0)
print("without column headers then,df.iloc[1].values ",df.iloc[1].values)
# print individual values  by index???
print("--------------------------------------")

# 2. Accessing a specific cell by row and column position
cell = df.iloc[2, 1]
# Returns 35 → Row 2 (Charlie), Column 1 (Age)
print("Accessing a specific cell by row and column position df.iloc[2, 1] ",cell)
print("--------------------------------------")

# 3. Accessing multiple rows by index position
rows = df.iloc[[0, 2]]
# Returns rows 0 and 2 (Alice and Charlie)
print("Accessing multiple rows by index position df.iloc[[0, 2]]  ",rows)
print("--------------------------------------")

# 4. Slicing rows (just like Python lists)
slice_rows = df.iloc[1:3]
# Returns rows 1 and 2 (Bob and Charlie)
print("Slicing rows (just like Python lists) df.iloc[1:3]  ",slice_rows)
print("--------------------------------------")

# 5. Accessing a range of rows and specific columns
subset = df.iloc[0:3, 0:2]
# Returns first 3 rows and first 2 columns (Name and Age)
print("Accessing a range of rows and specific columns df.iloc[0:3, 0:2]  ",subset)
print("--------------------------------------")

# 6. Accessing a whole column by position (as a Series)
column_1 = df.iloc[:, 1]
# Returns the 'Age' column
print("Accessing a whole column by position (as a Series) df.iloc[:, 1]  ",column_1)
print("--------------------------------------")

# 7. Accessing multiple specific columns by position
multi_cols = df.iloc[:, [0, 2]]
# Returns only 'Name' and 'City' columns for all rows
print("Accessing multiple specific columns by position df.iloc[:, [0, 2]]  ",multi_cols)
print("--------------------------------------")

# 8. Last row using negative indexing
last_row = df.iloc[-1]
# Returns the last row (David)
print("Last row using negative indexing df.iloc[-1]  ",last_row)
print("--------------------------------------")

# 9. Last 2 rows and last 2 columns
bottom_right = df.iloc[-2:, -2:]
print("Last 2 rows and last 2 columns df.iloc[-2:, -2:] ",bottom_right)
print("--------------------------------------")

# Returns:
#     Age     City
# 2   35   Chicago
# 3   40   Houston




  Full Name  Age Location Country  Age in 5 Years
0     Alice   25       NY     USA              30
1       Bob   30       LA     USA              35
2   Charlie   35  Chicago     USA              40
3     David   40  Houston     USA              45
--------------------------------------
Accessing a single row by position df.iloc[0]   Full Name         Bob
Age                30
Location           LA
Country           USA
Age in 5 Years     35
Name: 1, dtype: object
without column headers then,df.iloc[1].values  ['Bob' np.int64(30) 'LA' 'USA' np.int64(35)]
--------------------------------------
Accessing a specific cell by row and column position df.iloc[2, 1]  35
--------------------------------------
Accessing multiple rows by index position df.iloc[[0, 2]]     Full Name  Age Location Country  Age in 5 Years
0     Alice   25       NY     USA              30
2   Charlie   35  Chicago     USA              40
--------------------------------------
Slicing rows (just like Python lists) df

In [12]:
# rename the column
print(df.columns.name)
df.columns.name = "Index"
df

None


Index,Full Name,Age,Location,Country,Age in 5 Years
0,Alice,25,NY,USA,30
1,Bob,30,LA,USA,35
2,Charlie,35,Chicago,USA,40
3,David,40,Houston,USA,45


###### **Generate a sequence of datetime values (i.e., a range of dates) with a specified frequency.**


*   Syntax : `pd.date_range(start=None, end=None, periods=None, freq=None, ...)`
*   start: Start date (e.g., '2015-01-25' or '20150125')
* end: End date

* periods: Number of periods (i.e., number of dates)

* freq: Frequency string (e.g., 'D' for daily, 'M' for monthly)



In [13]:
# You can also use pandas to create an series of datetime objects. Let's make one for the week beginning January 25th, 2015:
dates = pd.date_range('20150125', periods=7)

dates

DatetimeIndex(['2015-01-25', '2015-01-26', '2015-01-27', '2015-01-28',
               '2015-01-29', '2015-01-30', '2015-01-31'],
              dtype='datetime64[ns]', freq='D')

In [39]:
print("Monthly dates (end of each month):")
print(pd.date_range('2022-01-01', periods=5, freq='ME'))
print("Weekly dates (Sundays by default):")
print(pd.date_range('2022-01-01', periods=5, freq='W'))
print("Hourly timestamps starting from midnight:")
print(pd.date_range('2022-01-01', periods=5, freq='h'))

Monthly dates (end of each month):
DatetimeIndex(['2022-01-31', '2022-02-28', '2022-03-31', '2022-04-30',
               '2022-05-31'],
              dtype='datetime64[ns]', freq='ME')
Weekly dates (Sundays by default):
DatetimeIndex(['2022-01-02', '2022-01-09', '2022-01-16', '2022-01-23',
               '2022-01-30'],
              dtype='datetime64[ns]', freq='W-SUN')
Hourly timestamps starting from midnight:
DatetimeIndex(['2022-01-01 00:00:00', '2022-01-01 01:00:00',
               '2022-01-01 02:00:00', '2022-01-01 03:00:00',
               '2022-01-01 04:00:00'],
              dtype='datetime64[ns]', freq='h')


Now we'll create a DataFrame using the dates array as our index, fill it with some random values using numpy, and give the columns some labels.

Note that `randn(7,5)` below matches the 7 dates (rows) and 5 names (columns)

(Otherwise it wouldn't work, try changing 5 to 6...)
### `np.random.randn(7, 5)`

- Comes from **NumPy** (`np`)
- Generates a **7×5 matrix** of random numbers drawn from a **standard normal distribution**  
  *(mean = 0, standard deviation = 1)*
- So you're getting **random values in 7 rows and 5 columns**


In [14]:
df = pd.DataFrame(np.random.randn(7,5), index=dates, columns=['Adam','Bob','Carla','Dave','Eve'])
df

Unnamed: 0,Adam,Bob,Carla,Dave,Eve
2015-01-25,0.989682,0.958731,0.49819,-2.014645,-1.646304
2015-01-26,0.985056,0.277313,-1.27163,0.021645,-0.6522
2015-01-27,1.003401,0.08156,-1.599394,-0.321854,-0.585767
2015-01-28,2.513243,0.001368,-0.378693,0.388481,-0.627812
2015-01-29,-0.167215,-1.0568,-3.065853,1.100188,0.248513
2015-01-30,-0.204601,2.071436,0.29953,-0.149921,1.885903
2015-01-31,-1.059301,-0.312548,-0.299291,-0.695659,1.478951


DataFrames are more flexible than that, both in terms of what you can store in them and what you can do with them.

It can also be useful to know how to create a DataFrame from a dict of objects.

This comes in particularly handy when working with JSON-like structures.

In [15]:
df2 = pd.DataFrame({ 'A' : np.random.random_sample(4), # 4 random numbers
                     'B' : pd.Timestamp('20130102'), # 4 dates, note pandas autofills
                     'C' : pd.date_range('20150125',periods = 4), # 4 dates in a range
                     'D' : ['a','b','c','d'], # letters
                     'E' : ["cat","dog","mouse","parrot"], # text/string
                     'F' : 'copy'}) # note pandas autofills

df2

Unnamed: 0,A,B,C,D,E,F
0,0.392002,2013-01-02,2015-01-25,a,cat,copy
1,0.214405,2013-01-02,2015-01-26,b,dog,copy
2,0.398435,2013-01-02,2015-01-27,c,mouse,copy
3,0.093292,2013-01-02,2015-01-28,d,parrot,copy




---



## **2. Exploring the data in a DataFrame**

Let's import the UFO sightings dataset via URL. We can access the data types of each column in a DataFrame as follows:

In [16]:
ufo = pd.read_csv('http://bit.ly/uforeports')

We can display the index, columns and the underlyinig numpy data separately:

In [17]:
ufo.index

RangeIndex(start=0, stop=18241, step=1)

In [18]:
ufo.columns

Index(['City', 'Colors Reported', 'Shape Reported', 'State', 'Time'], dtype='object')

We can accesss the data types of each column:

In [19]:
ufo.dtypes

Unnamed: 0,0
City,object
Colors Reported,object
Shape Reported,object
State,object
Time,object


In [20]:
ufo.values

array([['Ithaca', nan, 'TRIANGLE', 'NY', '6/1/1930 22:00'],
       ['Willingboro', nan, 'OTHER', 'NJ', '6/30/1930 20:00'],
       ['Holyoke', nan, 'OVAL', 'CO', '2/15/1931 14:00'],
       ...,
       ['Eagle River', nan, nan, 'WI', '12/31/2000 23:45'],
       ['Eagle River', 'RED', 'LIGHT', 'WI', '12/31/2000 23:45'],
       ['Ybor', nan, 'OVAL', 'FL', '12/31/2000 23:59']], dtype=object)

We can get the size of the data using `shape`:

In [21]:
ufo.shape

(18241, 5)

We can get a quick statistical summary of the data using `describe()` function:

In [22]:
ufo.describe()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
count,18215,2882,15597,18241,18241
unique,6475,27,27,52,16145
top,Seattle,RED,LIGHT,CA,11/16/1999 19:00
freq,187,780,2803,2529,27


We can have a look at the data in first five rows using `head()` function:

In [23]:
ufo.head()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00
1,Willingboro,,OTHER,NJ,6/30/1930 20:00
2,Holyoke,,OVAL,CO,2/15/1931 14:00
3,Abilene,,DISK,KS,6/1/1931 13:00
4,New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00


We can indicate how many row to return by specifying an integer:

In [24]:
ufo.head(20)

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00
1,Willingboro,,OTHER,NJ,6/30/1930 20:00
2,Holyoke,,OVAL,CO,2/15/1931 14:00
3,Abilene,,DISK,KS,6/1/1931 13:00
4,New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00
5,Valley City,,DISK,ND,9/15/1934 15:30
6,Crater Lake,,CIRCLE,CA,6/15/1935 0:00
7,Alma,,DISK,MI,7/15/1936 0:00
8,Eklutna,,CIGAR,AK,10/15/1936 17:00
9,Hubbard,,CYLINDER,OR,6/15/1937 0:00


We can also have the last five rows using `tail()` function, check the index numbers:

In [25]:
ufo.tail()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
18236,Grant Park,,TRIANGLE,IL,12/31/2000 23:00
18237,Spirit Lake,,DISK,IA,12/31/2000 23:00
18238,Eagle River,,,WI,12/31/2000 23:45
18239,Eagle River,RED,LIGHT,WI,12/31/2000 23:45
18240,Ybor,,OVAL,FL,12/31/2000 23:59


We can focus on a specific column:

In [26]:
ufo['City']

Unnamed: 0,City
0,Ithaca
1,Willingboro
2,Holyoke
3,Abilene
4,New York Worlds Fair
...,...
18236,Grant Park
18237,Spirit Lake
18238,Eagle River
18239,Eagle River


We can select a subset of rows by integer indexing:

In [27]:
ufo[1:3]

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
1,Willingboro,,OTHER,NJ,6/30/1930 20:00
2,Holyoke,,OVAL,CO,2/15/1931 14:00


**Note**：Only the rows wiht index 1 and 2 returned.

We can also select rows by specific values:

In [28]:
ufo[['City','State']][ufo.State == 'NJ']

Unnamed: 0,City,State
1,Willingboro,NJ
141,Newark,NJ
147,Sandy Hook,NJ
255,Keansburg,NJ
311,Red Bank,NJ
...,...,...
17961,Freehold,NJ
17976,East Brunswick,NJ
18006,West Allenhurst,NJ
18023,Elizabeth,NJ


**Note**: `ufo.State == 'NJ'` is an example of using conditional indexing. We can also have other conditions like `<`, `>`, `<=`, `>=` or `!=` (not equal).

In [29]:
ufo[['City','State']][ufo.State != 'NJ']

Unnamed: 0,City,State
0,Ithaca,NY
2,Holyoke,CO
3,Abilene,KS
4,New York Worlds Fair,NY
5,Valley City,ND
...,...,...
18236,Grant Park,IL
18237,Spirit Lake,IA
18238,Eagle River,WI
18239,Eagle River,WI


What is the difference between the following two functions `loc` and `iloc`?

In [30]:
ufo.loc[:,'City':'State'].head()

Unnamed: 0,City,Colors Reported,Shape Reported,State
0,Ithaca,,TRIANGLE,NY
1,Willingboro,,OTHER,NJ
2,Holyoke,,OVAL,CO
3,Abilene,,DISK,KS
4,New York Worlds Fair,,LIGHT,NY


In [31]:
ufo.iloc[1:6,2:6]

Unnamed: 0,Shape Reported,State,Time
1,OTHER,NJ,6/30/1930 20:00
2,OVAL,CO,2/15/1931 14:00
3,DISK,KS,6/1/1931 13:00
4,LIGHT,NY,4/18/1933 19:00
5,DISK,ND,9/15/1934 15:30


Enter your answer here....



---



## **Task 1: Load data and get the basic information**

In this task, you are asked to load a data file from Google Drive with your Monash account. Open the file to create a Pandas dataframe and explore it using the functions introduced above and see what information you can get from the data.

### **Connect with your Google Drive to access files**

In [98]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


After you run the cell, then it asks you to click on a url and login in order to give premissions to Colab. If you successfully followed the steps, you should now see a drive folder in the left pane of this notebook. see below figure.

![](https://drive.google.com/uc?export=download&id=1B2sooICEr_QDLEyFHOSwIQO89LODLfAq)

If you click on it, you should be able to see the "FIT5196" shared drive. If you are unable to see that, let us know ASAP. But if you can see it, then it means that now this notebook have access to everything on that shared drive. Let's read the `xmart` data from there.

In [14]:
# xmart = pd.read_csv('/content/drive/MyDrive/FIT5196/Week 2/xmart.csv',skiprows=1)
xmart = pd.read_csv('C:/Personal Stuff/Monash/Sem3/FIT5196DataWrangling/Labs/xmart.csv',skiprows=1)

Now, it is your turn to write Python codes and try to find out:
1. How many records in the dataset?
2. How many attributes in the dataset? What are they?
3. What is the data type for each of the attribute?
4. Without any description provided, can you summarize what information contains in this dataset?

**Note**: Don't forget to use markdown to explain your findings.

In [42]:
print("1. Number of records (rows):", xmart.shape[0])


1. Number of records (rows): 940


## Accessing Column Names and Values in Pandas

You can easily work with column names in a Pandas DataFrame using `.columns`, and access column data using indexing.

---

###  Example DataFrame

```python
import pandas as pd

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Paris', 'London']
})
```
## Summary Table: `.columns` vs `.values`

| Expression                 | Description                                   | Returns                          |
|---------------------------|-----------------------------------------------|----------------------------------|
| `df.columns`              | List of all column names                      | `Index(['Name', 'Age', 'City'])` |
| `df.columns[0]`           | First column name                             | `'Name'`                         |
| `df[df.columns[0]]`       | Data from the first column                    | Pandas Series                    |
| `df[df.columns[0]].values` | Values from the first column (as NumPy array) | `['Alice', 'Bob', 'Charlie']`    |
| `df.values`               | Entire DataFrame values (no labels)           | NumPy array (2D)                 |

---

### ⚠️ What Does `df.values` Do?

`df.values` returns the **entire dataset** as a **NumPy array**, without index or column names.

📌 It is useful when you want raw data (e.g., for machine learning models)


In [47]:
print("\n2. Number of attributes (columns):", xmart.shape[1])
print("   Attribute names:", list(xmart.columns.values))


2. Number of attributes (columns): 4
   Attribute names: ['Country', 'Data Source', 'Beverage Types', ' 2012']


##  Understanding Data Types in Pandas

Pandas uses **NumPy dtypes** to represent data types in DataFrames. Here's a summary of common data types, how they appear, and examples.

---

###  Common Pandas Data Types

| Pandas dtype | Description                          | Python Type           | Example Values               |
|--------------|--------------------------------------|------------------------|------------------------------|
| `object`     | Text (string) or mixed types         | `str` or mixed         | `"Apple"`, `"A123"`, `"NY"` |
| `int64`      | Integer numbers                      | `int`                  | `1`, `0`, `-100`, `9999`     |
| `float64`    | Floating point (decimal) numbers     | `float`                | `3.14`, `0.0`, `-1.23`       |
| `bool`       | Boolean values                       | `bool`                 | `True`, `False`              |
| `datetime64[ns]` | Dates and timestamps             | `datetime`             | `2022-01-01`, `2023-12-25`   |
| `category`   | Categorical data (fixed values)      | `category`             | `"Small"`, `"Medium"`, `"Large"` |

---

###  Examples in a DataFrame

```python
import pandas as pd

df = pd.DataFrame({
    'Product': ['Apple', 'Banana', 'Cherry'],       # object (string)
    'Price': [1.2, 0.8, 2.5],                       # float64
    'Quantity': [10, 20, 15],                       # int64
    'InStock': [True, True, False],                 # bool
    'RestockDate': pd.to_datetime(['2023-08-01', '2023-08-05', '2023-08-10']),  # datetime64[ns]
    'Size': pd.Categorical(['Small', 'Large', 'Medium'])  # category
})

print(df.dtypes)


In [48]:
# 3. What is the data type for each of the attribute?
print("\n3. Data types of each attribute:")
print(xmart.dtypes)


3. Data types of each attribute:
Country            object
Data Source        object
Beverage Types     object
 2012             float64
dtype: object


In [49]:
# 4. Summarize what information is in the dataset
print("\n4. Data sample (first 5 rows):")
print(xmart.head())


4. Data sample (first 5 rows):
       Country   Data Source              Beverage Types   2012
0  Afghanistan   Data source                        Wine   0.00
1  Afghanistan   Data source                        Beer   0.00
2  Afghanistan   Data source                     Spirits   0.00
3  Afghanistan   Data source                   All types   0.01
4  Afghanistan   Data source   Other alcoholic beverages   0.01


Here are the different values you can use for the `include` parameter in pandas `describe()`:

```python
# Include all columns (numeric, object, datetime, etc.)
print(df.describe(include='all'))

# Include only numeric columns (default behavior)
print(df.describe(include='number'))

# Include only object/categorical columns
print(df.describe(include='object'))

# Include only datetime columns
print(df.describe(include='datetime'))

# Include specific data types
print(df.describe(include=[np.number]))  # numeric types
print(df.describe(include=[np.object]))  # object types
print(df.describe(include=[np.datetime64]))  # datetime types

# Include multiple specific types
print(df.describe(include=['object', 'number']))
print(df.describe(include=[np.number, np.object]))

# Include by specific numpy dtypes
print(df.describe(include=[np.float64, np.int64, np.object]))
```

In [52]:
print("\n   Summary of dataset:")
print(xmart.describe(include='all'))


   Summary of dataset:
            Country   Data Source Beverage Types        2012
count           940           940            940  940.000000
unique          188             1              5         NaN
top     Afghanistan   Data source           Wine         NaN
freq              5           940            188         NaN
mean            NaN           NaN            NaN    1.985394
std             NaN           NaN            NaN    2.726358
min             NaN           NaN            NaN    0.000000
25%             NaN           NaN            NaN    0.060000
50%             NaN           NaN            NaN    0.740000
75%             NaN           NaN            NaN    3.000000
max             NaN           NaN            NaN   16.960000




---



## **3. Editing data in DataFrame**

We can apply basic editing operations to DataFrame objects, such as updating, deleting, duplication, adding new columns, insert new rows, and etc.

### **3.1 Dealing with missing values (simplest ways)**

Continue with ufo data, we are looking for the data rows having missing city data and simply remove them.

In [53]:
ufo[10:20]

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
10,Fontana,,LIGHT,CA,8/15/1937 21:00
11,Waterloo,,FIREBALL,AL,6/1/1939 20:00
12,Belton,RED,SPHERE,SC,6/30/1939 20:00
13,Keokuk,,OVAL,IA,7/7/1939 2:00
14,Ludington,,DISK,MI,6/1/1941 13:00
15,Forest Home,,CIRCLE,CA,7/2/1941 11:30
16,Los Angeles,,,CA,2/25/1942 0:00
17,Hapeville,,,GA,6/1/1942 22:30
18,Oneida,,RECTANGLE,TN,7/15/1942 1:00
19,Bering Sea,RED,OTHER,AK,4/30/1943 23:00


In [54]:
# check whether the values is NaN or not
ufo.isna()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
0,False,True,False,False,False
1,False,True,False,False,False
2,False,True,False,False,False
3,False,True,False,False,False
4,False,True,False,False,False
...,...,...,...,...,...
18236,False,True,False,False,False
18237,False,True,False,False,False
18238,False,True,True,False,False
18239,False,False,False,False,False


In [55]:
# find out how many missing values in each attribute
ufo.isna().sum()

Unnamed: 0,0
City,26
Colors Reported,15359
Shape Reported,2644
State,0
Time,0


<table style="background-color:#e6f2ff; padding: 10px; border: 1px solid #b3d1ff; border-radius: 5px;">
<tr><td>

<h3> Filtering Rows with Missing Values (NaNs) in Pandas</h3>

You can use the expression <code>ufo[ufo.isna().any(axis=1)]</code> to filter rows that contain <strong>any missing values</strong>.

---

<h4> Explanation</h4>

This code filters the <code>ufo</code> DataFrame to return only the rows where <strong>at least one column</strong> has a missing value (<code>NaN</code>):

<pre><code>ufo[ufo.isna().any(axis=1)]
</code></pre>

<h4> To Check for Columns with Missing Values</h4>

If you want to find out <strong>which columns</strong> have any missing values, change <code>axis</code> to <code>0</code>:

<pre><code>ufo.isna().any(axis=0)
</code></pre>

</td></tr>
</table>


In [61]:
# list all the rows with missing values
ufo[ufo.isna().any(axis=1)]

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00
1,Willingboro,,OTHER,NJ,6/30/1930 20:00
2,Holyoke,,OVAL,CO,2/15/1931 14:00
3,Abilene,,DISK,KS,6/1/1931 13:00
4,New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00
...,...,...,...,...,...
18235,Fountain Hills,,,AZ,12/31/2000 23:00
18236,Grant Park,,TRIANGLE,IL,12/31/2000 23:00
18237,Spirit Lake,,DISK,IA,12/31/2000 23:00
18238,Eagle River,,,WI,12/31/2000 23:45


In [62]:
ufo.isna().any(axis=0)



Unnamed: 0,0
City,True
Colors Reported,True
Shape Reported,True
State,False
Time,False


**Note**: The ufo dataset has 18,241 rows in total, but 15,755  of them have at least one missing value. We can simply remove all of them to keep the rows with complete data for next step analysis. To keep the original data, we create a new data frame to store the subset of data rows with complete values.

In [63]:
# remove data rows with missing values
ufo1 = ufo.dropna()
ufo1.head(20)

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
12,Belton,RED,SPHERE,SC,6/30/1939 20:00
19,Bering Sea,RED,OTHER,AK,4/30/1943 23:00
36,Portsmouth,RED,FORMATION,VA,7/10/1945 1:30
44,Blairsden,GREEN,SPHERE,CA,6/30/1946 19:00
82,San Jose,BLUE,CHEVRON,CA,7/15/1947 21:00
84,Modesto,BLUE,DISK,CA,8/8/1947 22:00
91,Scipio,RED,SPHERE,IN,5/10/1948 19:00
111,Tarrant City,ORANGE,CIRCLE,AL,8/15/1949 22:00
129,Napa,GREEN,DISK,CA,6/10/1950 0:00
138,Coeur d'Alene,ORANGE,CIGAR,ID,7/2/1950 13:00


In [64]:
ufo1.isna().sum()

Unnamed: 0,0
City,0
Colors Reported,0
Shape Reported,0
State,0
Time,0


In this way, only a very small part of the data is kept and we may not have enough data to work out any useful knowledge. Instead of removing all the missing values, we can replace missing parts with some specific values. For example, set them all to zero.

In [65]:
ufo2 = ufo.fillna(value=0)
ufo2.head(20)

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
0,Ithaca,0,TRIANGLE,NY,6/1/1930 22:00
1,Willingboro,0,OTHER,NJ,6/30/1930 20:00
2,Holyoke,0,OVAL,CO,2/15/1931 14:00
3,Abilene,0,DISK,KS,6/1/1931 13:00
4,New York Worlds Fair,0,LIGHT,NY,4/18/1933 19:00
5,Valley City,0,DISK,ND,9/15/1934 15:30
6,Crater Lake,0,CIRCLE,CA,6/15/1935 0:00
7,Alma,0,DISK,MI,7/15/1936 0:00
8,Eklutna,0,CIGAR,AK,10/15/1936 17:00
9,Hubbard,0,CYLINDER,OR,6/15/1937 0:00


Oops... we have nominal attributes, not numeric. Zeros do not work!!!

First, let's copy the DataFrame to have a new one to work on.

In [66]:
ufo3 = ufo.copy()
ufo3[10:20]

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
10,Fontana,,LIGHT,CA,8/15/1937 21:00
11,Waterloo,,FIREBALL,AL,6/1/1939 20:00
12,Belton,RED,SPHERE,SC,6/30/1939 20:00
13,Keokuk,,OVAL,IA,7/7/1939 2:00
14,Ludington,,DISK,MI,6/1/1941 13:00
15,Forest Home,,CIRCLE,CA,7/2/1941 11:30
16,Los Angeles,,,CA,2/25/1942 0:00
17,Hapeville,,,GA,6/1/1942 22:30
18,Oneida,,RECTANGLE,TN,7/15/1942 1:00
19,Bering Sea,RED,OTHER,AK,4/30/1943 23:00


In [67]:
ufo3['Colors Reported'].fillna('BLUE')
ufo3[10:20]

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
10,Fontana,,LIGHT,CA,8/15/1937 21:00
11,Waterloo,,FIREBALL,AL,6/1/1939 20:00
12,Belton,RED,SPHERE,SC,6/30/1939 20:00
13,Keokuk,,OVAL,IA,7/7/1939 2:00
14,Ludington,,DISK,MI,6/1/1941 13:00
15,Forest Home,,CIRCLE,CA,7/2/1941 11:30
16,Los Angeles,,,CA,2/25/1942 0:00
17,Hapeville,,,GA,6/1/1942 22:30
18,Oneida,,RECTANGLE,TN,7/15/1942 1:00
19,Bering Sea,RED,OTHER,AK,4/30/1943 23:00


Why do we stil have the missing values? Nothing happened?

Let's try it again:

In [68]:
# Change the null values with 'BLUE' in 'Colors Reported'
ufo3['Colors Reported'].fillna('BLUE', inplace=True)
ufo3[10:20]

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  ufo3['Colors Reported'].fillna('BLUE', inplace=True)


Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
10,Fontana,BLUE,LIGHT,CA,8/15/1937 21:00
11,Waterloo,BLUE,FIREBALL,AL,6/1/1939 20:00
12,Belton,RED,SPHERE,SC,6/30/1939 20:00
13,Keokuk,BLUE,OVAL,IA,7/7/1939 2:00
14,Ludington,BLUE,DISK,MI,6/1/1941 13:00
15,Forest Home,BLUE,CIRCLE,CA,7/2/1941 11:30
16,Los Angeles,BLUE,,CA,2/25/1942 0:00
17,Hapeville,BLUE,,GA,6/1/1942 22:30
18,Oneida,BLUE,RECTANGLE,TN,7/15/1942 1:00
19,Bering Sea,RED,OTHER,AK,4/30/1943 23:00


### **3.2 Create a new column**

Let's create a new column with the combined City and State place names, called `place` with an empty string in every row. This isn't absolutely necessary when using proper Pandas methods but for the demonstration it will make it more straight forward.

In [69]:
ufo4 = ufo.copy() # remember to use copy()
ufo4['place']=''
ufo4

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time,place
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00,
1,Willingboro,,OTHER,NJ,6/30/1930 20:00,
2,Holyoke,,OVAL,CO,2/15/1931 14:00,
3,Abilene,,DISK,KS,6/1/1931 13:00,
4,New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00,
...,...,...,...,...,...,...
18236,Grant Park,,TRIANGLE,IL,12/31/2000 23:00,
18237,Spirit Lake,,DISK,IA,12/31/2000 23:00,
18238,Eagle River,,,WI,12/31/2000 23:45,
18239,Eagle River,RED,LIGHT,WI,12/31/2000 23:45,


**Note**: By default, the new column is always added at the end

Before we combine the city and state, we need to check whether there are missing values in these two columns. From above, we know the `City` column has 26 rows with missing values and the `State` is complete. So we need to fill the `City` column before merging.

In [70]:
ufo4['City'].fillna('No city', inplace=True)
# from pandas 3 onwards
# ufo4['City'] = ufo4['City'].fillna('No city')
# ufo4.fillna({'City': 'No city'}, inplace=True)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  ufo4['City'].fillna('No city', inplace=True)


In [71]:
ufo4['place'] = ufo4['City'] + ', ' + ufo4['State']
ufo4

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time,place
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00,"Ithaca, NY"
1,Willingboro,,OTHER,NJ,6/30/1930 20:00,"Willingboro, NJ"
2,Holyoke,,OVAL,CO,2/15/1931 14:00,"Holyoke, CO"
3,Abilene,,DISK,KS,6/1/1931 13:00,"Abilene, KS"
4,New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00,"New York Worlds Fair, NY"
...,...,...,...,...,...,...
18236,Grant Park,,TRIANGLE,IL,12/31/2000 23:00,"Grant Park, IL"
18237,Spirit Lake,,DISK,IA,12/31/2000 23:00,"Spirit Lake, IA"
18238,Eagle River,,,WI,12/31/2000 23:45,"Eagle River, WI"
18239,Eagle River,RED,LIGHT,WI,12/31/2000 23:45,"Eagle River, WI"


We can use a `for` loop to achieve the same result.

Before we apply any operations that take use of index, remember to check the valid index range to avoid any errors.

In [72]:
ufo4.index

RangeIndex(start=0, stop=18241, step=1)

In [73]:
ufo4['address']=''
ufo4

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time,place,address
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00,"Ithaca, NY",
1,Willingboro,,OTHER,NJ,6/30/1930 20:00,"Willingboro, NJ",
2,Holyoke,,OVAL,CO,2/15/1931 14:00,"Holyoke, CO",
3,Abilene,,DISK,KS,6/1/1931 13:00,"Abilene, KS",
4,New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00,"New York Worlds Fair, NY",
...,...,...,...,...,...,...,...
18236,Grant Park,,TRIANGLE,IL,12/31/2000 23:00,"Grant Park, IL",
18237,Spirit Lake,,DISK,IA,12/31/2000 23:00,"Spirit Lake, IA",
18238,Eagle River,,,WI,12/31/2000 23:45,"Eagle River, WI",
18239,Eagle River,RED,LIGHT,WI,12/31/2000 23:45,"Eagle River, WI",


In [78]:
ufo4.iloc[:, 6].name # 7th column name

'address'

In [74]:
# Using a for loop to create each entry in turn

for i in ufo4.index:
    ufo4.iloc[i,6] = ufo4.iloc[i,0] + ', ' + ufo4.iloc[i,3]

In [75]:
ufo4


Unnamed: 0,City,Colors Reported,Shape Reported,State,Time,place,address
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00,"Ithaca, NY","Ithaca, NY"
1,Willingboro,,OTHER,NJ,6/30/1930 20:00,"Willingboro, NJ","Willingboro, NJ"
2,Holyoke,,OVAL,CO,2/15/1931 14:00,"Holyoke, CO","Holyoke, CO"
3,Abilene,,DISK,KS,6/1/1931 13:00,"Abilene, KS","Abilene, KS"
4,New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00,"New York Worlds Fair, NY","New York Worlds Fair, NY"
...,...,...,...,...,...,...,...
18236,Grant Park,,TRIANGLE,IL,12/31/2000 23:00,"Grant Park, IL","Grant Park, IL"
18237,Spirit Lake,,DISK,IA,12/31/2000 23:00,"Spirit Lake, IA","Spirit Lake, IA"
18238,Eagle River,,,WI,12/31/2000 23:45,"Eagle River, WI","Eagle River, WI"
18239,Eagle River,RED,LIGHT,WI,12/31/2000 23:45,"Eagle River, WI","Eagle River, WI"


We have the same values in both `place` and `address`. But, which way is better?

### **3.3 Timing it**

The notebook's magic `%%timeit` will run the cell 1000 times and get the 3 quickest times. We can use it to record the time and then do the comparison.

In [79]:
ufo5 = ufo.copy()
ufo5['City'].fillna('No city', inplace=True)
ufo5

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  ufo5['City'].fillna('No city', inplace=True)


Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00
1,Willingboro,,OTHER,NJ,6/30/1930 20:00
2,Holyoke,,OVAL,CO,2/15/1931 14:00
3,Abilene,,DISK,KS,6/1/1931 13:00
4,New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00
...,...,...,...,...,...
18236,Grant Park,,TRIANGLE,IL,12/31/2000 23:00
18237,Spirit Lake,,DISK,IA,12/31/2000 23:00
18238,Eagle River,,,WI,12/31/2000 23:45
18239,Eagle River,RED,LIGHT,WI,12/31/2000 23:45


In [80]:
ufo5['place']=''
ufo5['address']=''
ufo5.head()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time,place,address
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00,,
1,Willingboro,,OTHER,NJ,6/30/1930 20:00,,
2,Holyoke,,OVAL,CO,2/15/1931 14:00,,
3,Abilene,,DISK,KS,6/1/1931 13:00,,
4,New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00,,


In [81]:
%%timeit

ufo5['place'] = ufo5['City'] + ', ' + ufo5['State']

3.49 ms ± 818 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [82]:
%%timeit

for i in ufo5.index:
    ufo5.iloc[i,6] = ufo5.iloc[i,0] + ', ' + ufo5.iloc[i,3]

3.2 s ± 466 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


You can run the above codes many times, they may give different time, but using `for` loop is a much slower method.

**Note**: Pandas is based on numpy arrays, so try everything you can to aviod iterating over rows.

### **3.4 Delete columns**

We only need to keep one column for merged place, we can easily drop one.

In [83]:
ufo5.drop(columns=['address'], inplace=True)
ufo5.head()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time,place
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00,"Ithaca, NY"
1,Willingboro,,OTHER,NJ,6/30/1930 20:00,"Willingboro, NJ"
2,Holyoke,,OVAL,CO,2/15/1931 14:00,"Holyoke, CO"
3,Abilene,,DISK,KS,6/1/1931 13:00,"Abilene, KS"
4,New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00,"New York Worlds Fair, NY"




---



## **Task 2: Reproducing the data wrangling process**

Load an `Air Crashes` data, and try to answer the following questions:
1. Give a summary of the data, including size, attributes, data types
2. Does the dataset contain missing value? What are you going to deal with them?
3. Check if there are any columns can be merged together? Apply the merging operation.
4. Remove the column(s) that contains duplicated information after merging.
5. Find out a subset that records air crashes with survivors.

In [None]:
aircrash = pd.read_csv('C:/Personal Stuff/Monash/Sem3/FIT5196DataWrangling/Labs/AirCrashes.csv')
aircrash.head()

# Missing values are country and/or location. Lat/long data are still present
# Lat/long can be merged to coordinates. Country/location can be merged together with/without coordinates. 
# For completeness, get a library to fill in country/location using lat/long coordinates


Unnamed: 0,Flight,Mode,Casualties,Circumstances,Country,Crew,Date,Ground,Latitude,Location,Longitude,Notes,Passengers,Phase,Reason,Total Dead,Type
0,Aeroflot Flight 217,Ilyushin Il-62,Extremely High,Bad Visibility by Day,Russia,10,13/10/1972,0,55.755826,Moscow - Russia,37.6173,No survivors,164,APR,Accident,174,COM
1,Aeroflot Flight 3352,Tupolev Tu-154,Extremely High,Bad Visibility by Night,Russia,5,11/10/1984,4,54.966667,Omsk - Russia,73.383333,Some survivors,169,LDG,Accident,178,COM
2,Aeroflot Flight 4227,Tupolev Tu-154B-2,Extremely High,Bad Visibility by Night,Kazakhstan,10,1890-07-08,0,43.255058,Almaty - Kazakhstan,76.912628,No survivors,156,ENR,Accident,166,COM
3,Aeroflot Flight 7425,,Extremely High,Bad Visibility by Night,,9,10/7/1985,0,42.156667,Uchkuduk - Uzbekistan,63.555556,No survivors,191,ENR,Accident,0,COM
4,Aeroflot/Moldovia (CCCP-65816),Tupolev Tu-134A(both),Extremely High,Bad Visibility by Night,Kazakhstan,13,11/8/1979,0,48.8125,,46.763611,No survivors,165,ENR,Accident,178,COM


In [25]:
air_crashes = aircrash.copy()
air_crashes.shape

(58, 17)

## 5. Find out a subset that records air crashes with survivors.

---


##### Method 1: Using `.str.contains()`

```python
survivors = air_crashes[air_crashes['Notes'].str.contains("survivors", case=False, na=False)]
```

######  What it does:

- Looks for the word **"survivors"** in the `'Notes'` column  
- `case=False` → **Case-insensitive** (matches `"Survivors"`, `"SURVIVORS"`, etc.)  
- `na=False` → Treats `NaN` as `False` so they don’t cause errors  
-  Returns a **filtered DataFrame** with only rows where `'Notes'` mention **survivors**

##### Method 2: Using `.apply()` with `lambda`

```python
survivors = air_crashes[air_crashes['Notes'].apply(lambda x: 'survivors' in str(x).lower())]
```

######  What it does:

- Applies a **custom function** to each row in the `'Notes'` column  
- Converts each value to a string using `str(x)` and applies `.lower()` for **case-insensitive** matching  
- Checks if **"survivors"** is a **substring** in the note  
- Returns the **same result** as `.str.contains()`, but offers **more flexibility** for advanced logic (e.g., multiple keywords, conditional matching)




In [23]:
survivors = air_crashes[air_crashes['Notes'].apply(lambda x: 'no' not in str(x).lower())]
survivors.head()

Unnamed: 0,Flight,Mode,Casualties,Circumstances,Country,Crew,Date,Ground,Latitude,Location,Longitude,Notes,Passengers,Phase,Reason,Total Dead,Type
1,Aeroflot Flight 3352,Tupolev Tu-154,Extremely High,Bad Visibility by Night,Russia,5,11/10/1984,4,54.966667,Omsk - Russia,73.383333,Some survivors,169,LDG,Accident,178,COM
5,African Air (RA-26222),Antonov An-32B,Extremely High,Bad Visibility by Night,,0,8/1/1996,237,-4.331667,Kinshasa - DR Congo,15.313889,Some survivors,0,ICL,Accident,37,COM
12,All Nippon Airways Flight 58 and,Boeing 727-281 and F-86 Sabre,Extremely High,Bad Visibility by Night,Japan,7,30/7/1971,0,39.696287,Shizukuishi - Japan,140.975776,One survivor,155,ENR,Accident,162,COM
17,American Airlines Flight 965,Boeing Tu-154M,Extremely High,Bad Visibility by Night,,8,20/12/1995,0,3.9,Buga - Colombia,-76.3,Some survivors,151,APR,Accident,159,COM
19,Avianca Flight 011,Boeing 747-283B,Extremely High,Bad Visibility by Night,Spain,181,27/11/1983,0,40.416775,Madrid - Spain,-3.70379,Some survivors,162,APR,Accident,181,COM


In [24]:
survivors.shape # how many surviving accidents ?

(12, 17)

In [93]:
air_crashes['Coordinates'] = air_crashes['Latitude'].astype(str) + ', ' + air_crashes['Longitude'].astype(str) # can I merge longitude and latitude?

In [94]:
air_crashes_merged_lat_long = air_crashes.drop(columns=['Latitude', 'Longitude']) # not inplace


In [95]:
air_crashes.drop(columns=['Latitude', 'Longitude'], inplace=True) # or inplace

In [101]:
air_crashes.isna().sum() # missingvalues

Unnamed: 0,0
Flight,0
Mode,9
Casualties,3
Circumstances,5
Country,8
Crew,0
Date,0
Ground,0
Latitude,0
Location,6


In [None]:
# Fill categorical text fields with default values, should I do this???????
air_crashes.fillna({
    'Mode': 'Unknown',
    'Casualties': 'Unknown',
    'Circumstances': 'Not reported',
    'Country': 'Unknown',
    'Location': 'Unknown',
    'Type': 'Unknown'
}, inplace=True)

In [102]:
air_crashes[['Crew', 'Passengers', 'Total Dead']].describe()


Unnamed: 0,Crew,Passengers,Total Dead
count,58.0,58.0,58.0
mean,15.396552,200.724138,257.155172
std,22.838825,87.867283,232.558071
min,0.0,0.0,0.0
25%,9.0,159.0,171.75
50%,12.0,181.0,195.0
75%,15.0,241.5,263.25
max,181.0,560.0,1692.0
