# Essential Guide to Python Pandas

Pandas library has emerged as one of the most popular data processing tools for data professionals and software developers. It enables users to quickly apply data manipulation tasks such as handling missing data, merging multiple datasets and removing duplicate records. It is often used on a daily basis by data professionals and is considered an essential part of the data science toolkit. This course is for aspiring data professionals and Python developers who want to learn how to process data in Pandas. To take the best out of this course, you will need a minimum working knowledge about Python and are comfortable running data science documents using Jupyter notebook.

## Table of Contents

0. [ Jupyter Notebook Commands and Shortcuts](#section_0)
1. [ How to Import Pandas Library](#section_1)
2. [ Anatomy of Pandas Data Structures](#section_2)
3. [ Get Data into and from Pandas](#section_3)
    * [Python Native Data Structures](#section_3_1)
    * [Tabular Data Files](#)
    * [API Query and JSON Format](#)
    * [Web Pages Data](#)
4. [Describe Information in DataFrames](#)
5. [Understand Data Types](#)
6. [Data Cleaning in Pandas](#)
    * [Split & Merge Columns](#)
    * [Change Columns DataType](#)
    * [Rename Columns](#)
    * [Drop Rows and Columns](#)
    * [Manipulate text content](#)
7. [Pandas Merging & Joining Data](#)
8. [Data Summarization & Aggregation](#)
    * [Select Data by Column, Row, Index & Conditions](#)
    * [roup & Sort data](#)
9. [Pandas Data Visualization](#)
10. [Pandas Styling Settings](#)
11. [Pandas Analysis Project](#)
    * [Collect Data From Multiple Sources](#)
    * [Clean Data](#)
    * [Join DataFrames](#)
    * [Perform Basic Analysis](#)

### 0. Jupyter Notebook Commands and Shortcuts <a class="anchor" id="section_0"></a>

Jupyter Notebooks have two different keyboard input modes:

- Edit mode - that's the mode for you to type in a cell. Indicated by a green cell border
- Command mode - binds the keyboard to notebook level actions. Indicated by a grey cell border with a blue left margin"

Some of the most commonly used shortcuts include:

**Command Mode**

* shift + enter run cell, select below
* ctrl + enter run cell
* option + enter run cell, insert below
* A insert cell above
* B insert cell below
* C copy cell
* V paste cell
* D , D delete selected cell
* shift + M merge selected cells, or current cell with cell below if only one cell selected
* I , I interrupt kernel
* 0 , 0 restart kernel (with dialog)
* Y change cell to code mode
* M change cell to markdown mode (good for documentation)

**Edit Mode**

* cmd + click for multi-cursor editing
* option + scrolling click for column editing
* cmd + / toggle comment lines
* tab code completion or indent
* shift + tab tooltip
* ctrl + shift + - split cell

**Command Palette** 

* cmd + shift + p
If you want quick access to all the commands in Jupyter Notebooks, you can simply open the command palette with cmd + shift + p.

### 1. How to Import Pandas Library <a class="anchor" id="section_1"></a>

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Install pandas
!pip install pandas



In [3]:
# Check Pandas Version
pd.__version__

'1.2.3'

### 2. Anatomy of Pandas Data Structures <a class="anchor" id="section_2"></a>

The two main Pandas data structure objects are **DataFrames** and **Series**. Pandas `Dataframe` object is a two-dimensional labelled structure that can hold data in rows and columns, similar to a spreadsheet file or relational database table. Each DataFrame column (also called Pandas `Series` Object) is a one-dimensional labelled structure with a descriptive name and unique data type that applies to all values in that column. ***In other words, you can think of a DataFrame as a collection of Series.***

### 3. Get Data into and from Pandas <a class="anchor" id="section_3"></a>

Pandas library is designed to access data from a wide variety of sources and formats. Some popular data sources include tabular files, database tables, third party APIs and even using Python native data structures. This flexibility is what makes Pandas library useful for many user groups such as developers and data professionals.

#### 3.1 Python Native Data Structures <a class="anchor" id="section_3_1"></a>

Python programming language has a variety of built-in data structures such as `lists`, `tuples`, `dictionaries`, `strings`, `sets` and `frozensets`. These data structures are ideal for storing data during program execution, however, they can not be efficiently used to perform analytical tasks such as exploratory analysis and data visualization. Pandas library can transfer Python data structures into `DataFrame` objects to allow users to easily perform data manipulation and analytics. 

In [1]:
new_zealand = {'country_name':'New Zealand',
               'capital_city':'Wellington',
               'country_code':'NZ',
               'population':4783063,
               'area_km2':270467}
new_zealand

{'country_name': 'New Zealand',
 'capital_city': 'Wellington',
 'country_code': 'NZ',
 'population': 4783063,
 'area_km2': 270467}

In [2]:
type(new_zealand)

dict

In [4]:
# Create DataFrame from a list of dictionaries
import pandas as pd

list_of_countries = [
{'country_name':'China','capital_city':'Beijing','population':1433783686,'area_km2':9596961},
{'country_name':'New Zealand','capital_city':'Wellington','population':4783063,'area_km2':270467},
{'country_name':'South Africa','capital_city':'Pretoria','population':58558270,'area_km2':1221037},
{'country_name':'United Kingdom','capital_city':'London','population':67530172,'area_km2':242495},
{'country_name':'United States','capital_city':'Washington DC','population':329064917,'area_km2':9525067}]

countries = pd.DataFrame(list_of_countries, index = ['CN','NZ','ZA','GB','US'])


countries #.head()

Unnamed: 0,country_name,capital_city,population,area_km2
CN,China,Beijing,1433783686,9596961
NZ,New Zealand,Wellington,4783063,270467
ZA,South Africa,Pretoria,58558270,1221037
GB,United Kingdom,London,67530172,242495
US,United States,Washington DC,329064917,9525067


In [6]:
countries['capital_city'].tolist()

['Beijing', 'Wellington', 'Pretoria', 'London', 'Washington DC']

In [7]:
dictionary_of_countries = {'country_name': ['China', 'New Zealand', 'South Africa', 'United Kingdom', 'United States'],
                           'country_code': ['CN', 'NZ', 'ZA', 'GB', 'US'],
                           'Capital_city': ['Beijing', 'Wellington', 'Pretoria', 'London', 'Washington DC'],
                           'population': [1433783686, 4783063, 58558270, 67530172, 329064917],
                           'area_km2': [9596961, 270467, 1221037, 242495, 9525067]}

In [8]:
countries = pd.DataFrame.from_dict(dictionary_of_countries)
countries.head()

Unnamed: 0,country_name,country_code,Capital_city,population,area_km2
0,China,CN,Beijing,1433783686,9596961
1,New Zealand,NZ,Wellington,4783063,270467
2,South Africa,ZA,Pretoria,58558270,1221037
3,United Kingdom,GB,London,67530172,242495
4,United States,US,Washington DC,329064917,9525067


In [8]:
countries.size

20

In [9]:
countries.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   country_name  5 non-null      object 
 1   country_code  5 non-null      object 
 2   population    5 non-null      int64  
 3   area_km2      4 non-null      float64
dtypes: float64(1), int64(1), object(2)
memory usage: 288.0+ bytes


#### 3.2 Tabular Data Files <a class="anchor" id="section_3_2"></a>

In [11]:
alcohol_data = pd.read_csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/alcohol-consumption/drinks.csv')

alcohol_data.head()


Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol
0,Afghanistan,0,0,0,0.0
1,Albania,89,132,54,4.9
2,Algeria,25,0,14,0.7
3,Andorra,245,138,312,12.4
4,Angola,217,57,45,5.9


In [10]:
# Read data from a csv file
iris_data = pd.read_csv('https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/0e7a9b0a5d22642a06d3d5b9bcbad9890c8ee534/iris.csv', )
iris_data.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [11]:
# Read data from a csv file
countries_data = pd.read_csv('https://raw.githubusercontent.com/datasets/country-codes/master/data/country-codes.csv', index_col = 'CLDR display name')
countries_data.head()

# Dataset Referance 
#https://github.com/datasets/country-codes


Unnamed: 0_level_0,FIFA,Dial,ISO3166-1-Alpha-3,MARC,is_independent,ISO3166-1-numeric,GAUL,FIPS,WMO,ISO3166-1-Alpha-2,...,UNTERM Arabic Short,Sub-region Name,official_name_ru,Global Name,Capital,Continent,TLD,Languages,Geoname ID,EDGAR
CLDR display name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Taiwan,TPE,886,TWN,ch,Yes,158.0,925,TW,,TW,...,,,,,Taipei,AS,.tw,"zh-TW,zh,nan,hak",1668284.0,
Afghanistan,AFG,93,AFG,af,Yes,4.0,1,AF,AF,AF,...,أفغانستان,Southern Asia,Афганистан,World,Kabul,AS,.af,"fa-AF,ps,uz-AF,tk",1149361.0,B2
Albania,ALB,355,ALB,aa,Yes,8.0,3,AL,AB,AL,...,ألبانيا,Southern Europe,Албания,World,Tirana,EU,.al,"sq,el",783754.0,B3
Algeria,ALG,213,DZA,ae,Yes,12.0,4,AG,AL,DZ,...,الجزائر,Northern Africa,Алжир,World,Algiers,AF,.dz,ar-DZ,2589581.0,B4
American Samoa,ASA,1-684,ASM,as,Territory of US,16.0,5,AQ,,AS,...,,Polynesia,Американское Самоа,World,Pago Pago,OC,.as,"en-AS,sm,to",5880801.0,B5


#### 3.3 API Query and JSON Format <a class="anchor" id="section_3_3"></a>

In [28]:
# Import requests library to handle API connection
import requests
# Import and initialize Data pretty printer library
import pprint
pp = pprint.PrettyPrinter(indent=4)

In [18]:
# Pass the API query using requests library
response = requests.get("http://api.open-notify.org/astros.json")
# print(response.status_code)

# Convert response data into JSON format
response_data = response.json()

200


In [19]:
type(response_data)

dict

In [29]:
# Examine the response data
pp.pprint(response_data)

{   'message': 'success',
    'number': 7,
    'people': [   {'craft': 'ISS', 'name': 'Sergey Ryzhikov'},
                  {'craft': 'ISS', 'name': 'Kate Rubins'},
                  {'craft': 'ISS', 'name': 'Sergey Kud-Sverchkov'},
                  {'craft': 'ISS', 'name': 'Mike Hopkins'},
                  {'craft': 'ISS', 'name': 'Victor Glover'},
                  {'craft': 'ISS', 'name': 'Shannon Walker'},
                  {'craft': 'ISS', 'name': 'Soichi Noguchi'}]}


In [31]:
# Create a DataFrame of astronauts currently onbord the ISS
astronauts = pd.DataFrame(response_data['people'])
astronauts

Unnamed: 0,craft,name
0,ISS,Sergey Ryzhikov
1,ISS,Kate Rubins
2,ISS,Sergey Kud-Sverchkov
3,ISS,Mike Hopkins
4,ISS,Victor Glover
5,ISS,Shannon Walker
6,ISS,Soichi Noguchi


In [17]:
# For more informatiopn about APIs, see this tutorial
# https://www.dataquest.io/blog/python-api-tutorial/

### Query Data From SQL Table

In [12]:
# Import SQLite library
import sqlite3

 # Assign the database name
db_path = r'local_db_example.db'

# Create the database file
conn = sqlite3.connect(db_path) 

# Establish a connection with the database file
c = conn.cursor() 

In [13]:
# Create a database table
c.execute("""CREATE TABLE mytable
         (id, name, position)""")

<sqlite3.Cursor at 0x7f15881ca960>

In [14]:
# Add some data
c.execute("""INSERT INTO mytable (id, name, position)
          values(1, 'James', 'Data Scientist')""")

c.execute("""INSERT INTO mytable (id, name, position)
          values(2, 'Mary', 'Software Developer')""")

c.execute("""INSERT INTO mytable (id, name, position)
          values(3, 'Max', 'Data Engineer')""")

<sqlite3.Cursor at 0x7f15881ca960>

In [15]:
# Commit changes and close the connection
conn.commit()
c.close()

In [16]:
# Query the data into a Pandas DataFrame object

# Identify the database name
database = "local_db_example.db"

# Establish a connection with the database file
conn = sqlite3.connect(database)

# Use Pandas function to pass SQL query and create a DataFrame object
people = pd.read_sql("select * from mytable", con=conn)

# Print the generated DataFrame
print(people)

# Close the connection
conn.close()

   id   name            position
0   1  James      Data Scientist
1   2   Mary  Software Developer
2   3    Max       Data Engineer


#### 3.4 Web Pages Data <a class="anchor" id="section_3_4"></a>

In [32]:
web_data = pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)')
type(web_data)

<class 'list'>

In [23]:
type(web_data)

list

In [24]:
len(web_data)

5

In [33]:
web_countries_table = web_data[3]

In [35]:
web_countries_table.head()

Unnamed: 0,Country/Territory,UN continentalregion[4],UN statisticalsubregion[4],Population(1 July 2018),Population(1 July 2019),Change
0,China[a],Asia,Eastern Asia,1427647786,1433783686,+0.43%
1,India,Asia,Southern Asia,1352642280,1366417754,+1.02%
2,United States,Americas,Northern America,327096265,329064917,+0.60%
3,Indonesia,Asia,South-eastern Asia,267670543,270625568,+1.10%
4,Pakistan,Asia,Southern Asia,212228286,216565318,+2.04%


In [36]:
print (web_countries_table)

                     Country/Territory UN continentalregion[4]  \
0                             China[a]                    Asia   
1                                India                    Asia   
2                        United States                Americas   
3                            Indonesia                    Asia   
4                             Pakistan                    Asia   
..                                 ...                     ...   
229  Falkland Islands (United Kingdom)                Americas   
230                 Niue (New Zealand)                 Oceania   
231              Tokelau (New Zealand)                 Oceania   
232                    Vatican City[z]                  Europe   
233                              World                     NaN   

    UN statisticalsubregion[4]  Population(1 July 2018)  \
0                 Eastern Asia               1427647786   
1                Southern Asia               1352642280   
2             Northern America

### 4. Describe Information in Dataframes <a class="anchor" id="section_4"></a>

In [27]:
web_countries_list.size

1404

In [28]:
countries_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 250 entries, Taiwan to Åland Islands
Data columns (total 55 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   FIFA                                     239 non-null    object 
 1   Dial                                     249 non-null    object 
 2   ISO3166-1-Alpha-3                        249 non-null    object 
 3   MARC                                     249 non-null    object 
 4   is_independent                           249 non-null    object 
 5   ISO3166-1-numeric                        249 non-null    float64
 6   GAUL                                     243 non-null    object 
 7   FIPS                                     249 non-null    object 
 8   WMO                                      246 non-null    object 
 9   ISO3166-1-Alpha-2                        248 non-null    object 
 10  ITU                                     

In [29]:
population_data = pd.read_csv('https://raw.githubusercontent.com/datasets/population/master/data/population.csv')

In [30]:
population_data.shape

(15409, 4)

In [31]:
population_data.size

61636

In [32]:
population_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15409 entries, 0 to 15408
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Country Name  15409 non-null  object
 1   Country Code  15409 non-null  object
 2   Year          15409 non-null  int64 
 3   Value         15409 non-null  int64 
dtypes: int64(2), object(2)
memory usage: 481.7+ KB


In [2]:
alcohol_data = pd.read_csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/alcohol-consumption/drinks.csv')

In [34]:
alcohol_data.shape

(193, 5)

In [35]:
alcohol_data.size

965

In [37]:
alcohol_data['country'].size

193

In [40]:
len(alcohol_data['country'])

193

In [3]:
alcohol_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 5 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   country                       193 non-null    object 
 1   beer_servings                 193 non-null    int64  
 2   spirit_servings               193 non-null    int64  
 3   wine_servings                 193 non-null    int64  
 4   total_litres_of_pure_alcohol  193 non-null    float64
dtypes: float64(1), int64(3), object(1)
memory usage: 7.7+ KB


In [4]:
alcohol_data.describe()

Unnamed: 0,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol
count,193.0,193.0,193.0,193.0
mean,106.160622,80.994819,49.450777,4.717098
std,101.143103,88.284312,79.697598,3.773298
min,0.0,0.0,0.0,0.0
25%,20.0,4.0,1.0,1.3
50%,76.0,56.0,8.0,4.2
75%,188.0,128.0,59.0,7.2
max,376.0,438.0,370.0,14.4


In [6]:
countries_data = pd.read_csv('https://raw.githubusercontent.com/datasets/country-codes/master/data/country-codes.csv')
countries_data.head()

Unnamed: 0,FIFA,Dial,ISO3166-1-Alpha-3,MARC,is_independent,ISO3166-1-numeric,GAUL,FIPS,WMO,ISO3166-1-Alpha-2,...,Sub-region Name,official_name_ru,Global Name,Capital,Continent,TLD,Languages,Geoname ID,CLDR display name,EDGAR
0,TPE,886,TWN,ch,Yes,158.0,925,TW,,TW,...,,,,Taipei,AS,.tw,"zh-TW,zh,nan,hak",1668284.0,Taiwan,
1,AFG,93,AFG,af,Yes,4.0,1,AF,AF,AF,...,Southern Asia,Афганистан,World,Kabul,AS,.af,"fa-AF,ps,uz-AF,tk",1149361.0,Afghanistan,B2
2,ALB,355,ALB,aa,Yes,8.0,3,AL,AB,AL,...,Southern Europe,Албания,World,Tirana,EU,.al,"sq,el",783754.0,Albania,B3
3,ALG,213,DZA,ae,Yes,12.0,4,AG,AL,DZ,...,Northern Africa,Алжир,World,Algiers,AF,.dz,ar-DZ,2589581.0,Algeria,B4
4,ASA,1-684,ASM,as,Territory of US,16.0,5,AQ,,AS,...,Polynesia,Американское Самоа,World,Pago Pago,OC,.as,"en-AS,sm,to",5880801.0,American Samoa,B5


In [7]:
countries_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 56 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   FIFA                                     239 non-null    object 
 1   Dial                                     249 non-null    object 
 2   ISO3166-1-Alpha-3                        249 non-null    object 
 3   MARC                                     249 non-null    object 
 4   is_independent                           249 non-null    object 
 5   ISO3166-1-numeric                        249 non-null    float64
 6   GAUL                                     243 non-null    object 
 7   FIPS                                     249 non-null    object 
 8   WMO                                      246 non-null    object 
 9   ISO3166-1-Alpha-2                        248 non-null    object 
 10  ITU                                      247 non-n

In [8]:
countries_data['Region Name'].value_counts()

Africa      60
Americas    57
Europe      52
Asia        50
Oceania     29
Name: Region Name, dtype: int64

In [9]:
countries_data['Region Name'].unique()

array([nan, 'Asia', 'Europe', 'Africa', 'Oceania', 'Americas'],
      dtype=object)

In [42]:
countries_data['Region Name'].isnull().sum()

2

## Cleaning Data

In [26]:
import pandas as pd

In [27]:
list_of_countries = [
{'Country Name':'China','ISO Code':'CN','Country Population':1433783686,'Country Area km2 (mi2)':'9,596,961 (3,705,407)','Independence Day':'1 October 1949'},
{'Country Name':'New Zealand','ISO Code':'NZ','Country Population':4783063,'Country Area km2 (mi2)':'270,467 (104,428)','Independence Day':'26 September 1907'},
{'Country Name':'South Africa','ISO Code':'ZA','Country Population':58558270,'Country Area km2 (mi2)':'1,221,037 (471,445)','Independence Day':'31 May 1910'},
{'Country Name':'Australia','ISO Code':'AU','Country Population':25763300,'Country Area km2 (mi2)':'7,692,024 (2,969,907)', 'Independence Day':'1 January 1901'},
{'Country Name':'United States','ISO Code':'US','Country Population':329064917,'Country Area km2 (mi2)':'9,525,067 (3,677,649)','Independence Day':'4 July 1776'},
{'Country Name':'New Zealand','ISO Code':'NZ','Country Population':4783063,'Country Area km2 (mi2)':'270,467 (104,428)','Independence Day':'26 September 1907'}]

In [28]:
countries = pd.DataFrame(list_of_countries)
countries

Unnamed: 0,Country Name,ISO Code,Country Population,Country Area km2 (mi2),Independence Day
0,China,CN,1433783686,"9,596,961 (3,705,407)",1 October 1949
1,New Zealand,NZ,4783063,"270,467 (104,428)",26 September 1907
2,South Africa,ZA,58558270,"1,221,037 (471,445)",31 May 1910
3,Australia,AU,25763300,"7,692,024 (2,969,907)",1 January 1901
4,United States,US,329064917,"9,525,067 (3,677,649)",4 July 1776
5,New Zealand,NZ,4783063,"270,467 (104,428)",26 September 1907


In [29]:
countries.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 5 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Country Name            6 non-null      object
 1   ISO Code                6 non-null      object
 2   Country Population      6 non-null      int64 
 3   Country Area km2 (mi2)  6 non-null      object
 4   Independence Day        6 non-null      object
dtypes: int64(1), object(4)
memory usage: 368.0+ bytes


## Slit DataFrame Columns

In [30]:
countries['Country Area km2 (mi2)'].str.split(' ', expand = True)

Unnamed: 0,0,1
0,9596961,"(3,705,407)"
1,270467,"(104,428)"
2,1221037,"(471,445)"
3,7692024,"(2,969,907)"
4,9525067,"(3,677,649)"
5,270467,"(104,428)"


In [31]:
countries[['Area km2','Area mi2']] = countries['Country Area km2 (mi2)'].str.split(' ', expand = True)
countries

Unnamed: 0,Country Name,ISO Code,Country Population,Country Area km2 (mi2),Independence Day,Area km2,Area mi2
0,China,CN,1433783686,"9,596,961 (3,705,407)",1 October 1949,9596961,"(3,705,407)"
1,New Zealand,NZ,4783063,"270,467 (104,428)",26 September 1907,270467,"(104,428)"
2,South Africa,ZA,58558270,"1,221,037 (471,445)",31 May 1910,1221037,"(471,445)"
3,Australia,AU,25763300,"7,692,024 (2,969,907)",1 January 1901,7692024,"(2,969,907)"
4,United States,US,329064917,"9,525,067 (3,677,649)",4 July 1776,9525067,"(3,677,649)"
5,New Zealand,NZ,4783063,"270,467 (104,428)",26 September 1907,270467,"(104,428)"


In [32]:
countries['Area km2'] = countries['Area km2'].str.replace('(\D+)','')
countries['Area mi2'] = countries['Area mi2'].str.replace('(\D+)','')
countries

  """Entry point for launching an IPython kernel.
  


Unnamed: 0,Country Name,ISO Code,Country Population,Country Area km2 (mi2),Independence Day,Area km2,Area mi2
0,China,CN,1433783686,"9,596,961 (3,705,407)",1 October 1949,9596961,3705407
1,New Zealand,NZ,4783063,"270,467 (104,428)",26 September 1907,270467,104428
2,South Africa,ZA,58558270,"1,221,037 (471,445)",31 May 1910,1221037,471445
3,Australia,AU,25763300,"7,692,024 (2,969,907)",1 January 1901,7692024,2969907
4,United States,US,329064917,"9,525,067 (3,677,649)",4 July 1776,9525067,3677649
5,New Zealand,NZ,4783063,"270,467 (104,428)",26 September 1907,270467,104428


In [33]:
countries.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 7 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Country Name            6 non-null      object
 1   ISO Code                6 non-null      object
 2   Country Population      6 non-null      int64 
 3   Country Area km2 (mi2)  6 non-null      object
 4   Independence Day        6 non-null      object
 5   Area km2                6 non-null      object
 6   Area mi2                6 non-null      object
dtypes: int64(1), object(6)
memory usage: 464.0+ bytes


In [34]:
countries = countries.astype({'Area km2': 'int64', 'Area mi2':'int64', 'Independence Day':'datetime64'})
countries.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 7 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   Country Name            6 non-null      object        
 1   ISO Code                6 non-null      object        
 2   Country Population      6 non-null      int64         
 3   Country Area km2 (mi2)  6 non-null      object        
 4   Independence Day        6 non-null      datetime64[ns]
 5   Area km2                6 non-null      int64         
 6   Area mi2                6 non-null      int64         
dtypes: datetime64[ns](1), int64(3), object(3)
memory usage: 464.0+ bytes


In [35]:
#countries = countries.drop_duplicates()
countries

Unnamed: 0,Country Name,ISO Code,Country Population,Country Area km2 (mi2),Independence Day,Area km2,Area mi2
0,China,CN,1433783686,"9,596,961 (3,705,407)",1949-10-01,9596961,3705407
1,New Zealand,NZ,4783063,"270,467 (104,428)",1907-09-26,270467,104428
2,South Africa,ZA,58558270,"1,221,037 (471,445)",1910-05-31,1221037,471445
3,Australia,AU,25763300,"7,692,024 (2,969,907)",1901-01-01,7692024,2969907
4,United States,US,329064917,"9,525,067 (3,677,649)",1776-07-04,9525067,3677649
5,New Zealand,NZ,4783063,"270,467 (104,428)",1907-09-26,270467,104428


In [36]:
countries.drop('Country Area km2 (mi2)', axis = 1, inplace = True)

In [37]:
countries.drop(5, axis = 0, inplace = True)

## Rename Columns

In [41]:
countries.rename(columns={'Country Name': 'country_name', 'ISO Code': 'country_code',
                          'Country Population': 'country_population', 'Independence Day': 'independence_date',
                          'Area km2': 'area_km2', 'Area mi2': 'area_mi2'}, inplace=True)


In [42]:
countries.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   country_name        5 non-null      object        
 1   country_code        5 non-null      object        
 2   country_population  5 non-null      int64         
 3   independence_date   5 non-null      datetime64[ns]
 4   area_km2            5 non-null      int64         
 5   area_mi2            5 non-null      int64         
dtypes: datetime64[ns](1), int64(3), object(2)
memory usage: 280.0+ bytes


In [43]:
countries

Unnamed: 0,country_name,country_code,country_population,independence_date,area_km2,area_mi2
0,China,CN,1433783686,1949-10-01,9596961,3705407
1,New Zealand,NZ,4783063,1907-09-26,270467,104428
2,South Africa,ZA,58558270,1910-05-31,1221037,471445
3,Australia,AU,25763300,1901-01-01,7692024,2969907
4,United States,US,329064917,1776-07-04,9525067,3677649
