## Types of Data

Each column(feature) in a pandas dataframe has a datatype associated with it. Those datatypes can be grouped into **Numerical**, **Categorical**, and **Dates**.

In [2]:
#| echo: false

# import image module
from IPython.display import Image

# get the image
Image(url="./images/datatypes.gif",width=500)

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [3]:
movie_ratings = pd.read_csv('./datasets/movie_ratings.csv')
movie_ratings.head()

Unnamed: 0,Title,US Gross,Worldwide Gross,Production Budget,Release Date,MPAA Rating,Source,Major Genre,Creative Type,IMDB Rating,IMDB Votes
0,Opal Dreams,14443,14443,9000000,Nov 22 2006,PG/PG-13,Adapted screenplay,Drama,Fiction,6.5,468
1,Major Dundee,14873,14873,3800000,Apr 07 1965,PG/PG-13,Adapted screenplay,Western/Musical,Fiction,6.7,2588
2,The Informers,315000,315000,18000000,Apr 24 2009,R,Adapted screenplay,Horror/Thriller,Fiction,5.2,7595
3,Buffalo Soldiers,353743,353743,15000000,Jul 25 2003,R,Adapted screenplay,Comedy,Fiction,6.9,13510
4,The Last Sin Eater,388390,388390,2200000,Feb 09 2007,PG/PG-13,Adapted screenplay,Drama,Fiction,5.7,1012


The `dtypes` property is used to find the data types associated with each column in the DataFrame.

In [4]:
movie_ratings.dtypes

Title                 object
US Gross               int64
Worldwide Gross        int64
Production Budget      int64
Release Date          object
MPAA Rating           object
Source                object
Major Genre           object
Creative Type         object
IMDB Rating          float64
IMDB Votes             int64
dtype: object

### Available Data Types and Associated Built-in Functions

In a DataFrame, columns can have different data types. Here are the common data types you'll encounter and some built-in functions associated with each type:

1. **Numerical Data (int, float)**
   - Built-in functions: `mean()`, `sum()`, `min()`, `max()`, `std()`, `median()`, `quantile()`, etc.

2. **Object Data (str or mixed types)**
   - Built-in functions: `str.contains()`, `str.startswith()`, `str.endswith()`, `str.lower()`, `str.upper()`, `str.replace()`, etc.

3. **Datetime Data (dates)**
   - Built-in functions: `dt.year`, `dt.month`, `dt.day`, `dt.strftime()`, `dt.weekday()`, `dt.hour`, etc.

These functions help in exploring and transforming the data effectively depending on the type of data in each column.

### Data Type Filtering
We can filter the columns based on its data types

In [None]:

# select just categorical(object) columns
movie_ratings.select_dtypes(include='object').head()

Unnamed: 0,Title,Release Date,MPAA Rating,Source,Major Genre,Creative Type
0,Opal Dreams,Nov 22 2006,PG/PG-13,Adapted screenplay,Drama,Fiction
1,Major Dundee,Apr 07 1965,PG/PG-13,Adapted screenplay,Western/Musical,Fiction
2,The Informers,Apr 24 2009,R,Adapted screenplay,Horror/Thriller,Fiction
3,Buffalo Soldiers,Jul 25 2003,R,Adapted screenplay,Comedy,Fiction
4,The Last Sin Eater,Feb 09 2007,PG/PG-13,Adapted screenplay,Drama,Fiction


In [8]:
# select the numeric columns
movie_ratings.select_dtypes(include='number').head()

Unnamed: 0,US Gross,Worldwide Gross,Production Budget,IMDB Rating,IMDB Votes
0,14443,14443,9000000,6.5,468
1,14873,14873,3800000,6.7,2588
2,315000,315000,18000000,5.2,7595
3,353743,353743,15000000,6.9,13510
4,388390,388390,2200000,5.7,1012


###  Data Type Conversion

Often, after the inital reading, we need to convert the datatypes of some of the columns to make them suitable for analysis,as the available functions and operations depend on the column's data type. For example, the datatype of Release Date in the DataFrame `movie_ratings` is object. To perform datetime related computations on this variable, we’ll need to convert it to a datatime format. We’ll use the Pandas function `to_datatime()` to covert it to a datatime format. Similar functions such as `to_numeric()`, `to_string()` etc., can be used for other conversions.

In Pandas, the `errors='coerce'` parameter is often used in the context of data conversion, specifically when using the `pd.to_numeric` function. This argument tells Pandas to convert values that it can and set the ones it cannot convert to `NaN`. It's a way of gracefully handling errors without raising an exception. Read the textbook for an example

In [7]:
# check the datatype of release data column 
print(movie_ratings['Release Date'].dtypes)
movie_ratings['Release Date'].head()

object


0    Nov 22 2006
1    Apr 07 1965
2    Apr 24 2009
3    Jul 25 2003
4    Feb 09 2007
Name: Release Date, dtype: object

Next, we’ll convert the `Release Date` column in the DataFrame to the `datetime` format to facilitate further analysis.

In [8]:
movie_ratings['Release Date'] = pd.to_datetime(movie_ratings['Release Date'])

# Let's check the datatype of release data column again
movie_ratings['Release Date'].dtypes

dtype('<M8[ns]')

`dtype('<M8[ns]')` means a 64-bit datetime object with nanosecond precision stored in little-endian format. This data type is commonly used to represent timestamps in high-resolution time series data.

### Working with `datatime` Data

#### the `dt` accessor

The `.dt` accessor is a powerful tool in pandas that allows you to extract and manipulate components of datetime columns in a DataFrame. This is useful for analysis and feature engineering when dealing with time-related data. 

You can extract various parts of a datetime column using `.dt`

| Attribute         | Description                                   | Example                                 |
|-------------------|-----------------------------------------------|-----------------------------------------|
| `.dt.year`        | Extracts the year                            | `df['Release Date'].dt.year`           |
| `.dt.month`       | Extracts the month (1-12)                    | `df['Release Date'].dt.month`          |
| `.dt.day`         | Extracts the day of the month (1-31)         | `df['Release Date'].dt.day`            |
| `.dt.hour`        | Extracts the hour (0-23)                     | `df['Release Date'].dt.hour`           |
| `.dt.minute`      | Extracts the minute                          | `df['Release Date'].dt.minute`         |
| `.dt.second`      | Extracts the second                          | `df['Release Date'].dt.second`         |
| `.dt.weekday`     | Extracts the day of the week (0=Monday, 6=Sunday) | `df['Release Date'].dt.weekday`    |
| `.dt.dayofyear`   | Extracts the day of the year (1-366)         | `df['Release Date'].dt.dayofyear`      |
| `.dt.is_leap_year`| Checks if the year is a leap year            | `df['Release Date'].dt.is_leap_year`   |

Let's add the year for the movie_ratings dataframe next


In [18]:
# Extracting the year from the release date
movie_ratings['Release Year'] = movie_ratings['Release Date'].dt.year
movie_ratings.head()

Unnamed: 0,Title,US Gross,Worldwide Gross,Production Budget,Release Date,MPAA Rating,Source,Major Genre,Creative Type,IMDB Rating,IMDB Votes,Release Year
0,Opal Dreams,14443,14443,9000000,2006-11-22,PG/PG-13,Adapted screenplay,Drama,Fiction,6.5,468,2006
1,Major Dundee,14873,14873,3800000,1965-04-07,PG/PG-13,Adapted screenplay,Western/Musical,Fiction,6.7,2588,1965
2,The Informers,315000,315000,18000000,2009-04-24,R,Adapted screenplay,Horror/Thriller,Fiction,5.2,7595,2009
3,Buffalo Soldiers,353743,353743,15000000,2003-07-25,R,Adapted screenplay,Comedy,Fiction,6.9,13510,2003
4,The Last Sin Eater,388390,388390,2200000,2007-02-09,PG/PG-13,Adapted screenplay,Drama,Fiction,5.7,1012,2007


In your pandas dataframe, if having start date and end date, your can calculate the time duration between them like below

In [9]:
# let's calculate the days since release till Jan 1st 2024
movie_ratings['Days Since Release'] = (pd.Timestamp('2024-01-01') - movie_ratings['Release Date']).dt.days
movie_ratings.head()


Unnamed: 0,Title,US Gross,Worldwide Gross,Production Budget,Release Date,MPAA Rating,Source,Major Genre,Creative Type,IMDB Rating,IMDB Votes,Days Since Release
0,Opal Dreams,14443,14443,9000000,2006-11-22,PG/PG-13,Adapted screenplay,Drama,Fiction,6.5,468,6249
1,Major Dundee,14873,14873,3800000,1965-04-07,PG/PG-13,Adapted screenplay,Western/Musical,Fiction,6.7,2588,21453
2,The Informers,315000,315000,18000000,2009-04-24,R,Adapted screenplay,Horror/Thriller,Fiction,5.2,7595,5365
3,Buffalo Soldiers,353743,353743,15000000,2003-07-25,R,Adapted screenplay,Comedy,Fiction,6.9,13510,7465
4,The Last Sin Eater,388390,388390,2200000,2007-02-09,PG/PG-13,Adapted screenplay,Drama,Fiction,5.7,1012,6170


**Filtering Data**: Use extracted datetime components to filter rows. Aggregation and grouping using datetime components will be covered in future chapters.

In [21]:
# Filter rows where the release month is January
january_releases = movie_ratings[movie_ratings['Release Date'].dt.month == 1]
january_releases.head()


Unnamed: 0,Title,US Gross,Worldwide Gross,Production Budget,Release Date,MPAA Rating,Source,Major Genre,Creative Type,IMDB Rating,IMDB Votes,Release Year,Days Since Release
15,Thr3e,1008849,1060418,2400000,2007-01-05,PG/PG-13,Adapted screenplay,Horror/Thriller,Fiction,5.0,2825,2007,6205
57,Impostor,6114237,6114237,40000000,2002-01-04,PG/PG-13,Adapted screenplay,Horror/Thriller,Fiction,6.0,9020,2002,8032
62,The Last Station,6616974,6616974,18000000,2010-01-15,R,Adapted screenplay,Drama,Non-Fiction,7.0,3465,2010,5099
63,The Big Bounce,6471394,6626115,50000000,2004-01-30,PG/PG-13,Adapted screenplay,Comedy,Fiction,4.8,9195,2004,7276
84,Not Easily Broken,10572742,10572742,5000000,2009-01-09,PG/PG-13,Adapted screenplay,Drama,Fiction,5.2,1010,2009,5470


### Working with `object` Data

In pandas, the `object` data type is a flexible data type that can store a mix of text (strings), mixed types, or arbitrary Python objects. It is commonly used for string data and is a key part of working with categorical or unstructured data in pandas.

Similar to datetime objects having a `dt` accessor, a number of specialized string methods are available when using the `str` accessor. These methods have in general matching names with the equivalent built-in string methods for single elements, but are applied element-wise on each of the values of the columns.

#### the `str` accessor in pandas

The `str` accessor in pandas provides a wide range of string methods that allow for efficient and convenient text processing on an entire Series of strings. Here are some commonly used str methods:

* String splitting: `str.split()`
* String joining: `str.join()`
* Substrings: `str.slice(start, stop)`, `str[0]`
* String Case Conversion: `str.lower()`, `str.upper()`, `str.capitalize()`
* Whitespace Removal: `str.strip()`, `str.lstrip()`, `str.rstrip()`
* Replacing and Removing: `str.replace('old', 'new')`
* Pattern matching and extraction: `str.contains('pattern')`, `startswith('prefix')`, `endswith('suffix')`
* String length and counting: `str.len()`, `str.count()`

Let's use the well-known titanic dataset to illustrate how to manipulate string columns in pandas dataframe next

In [25]:
titanic = pd.read_csv('./Datasets/titanic.csv')
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",35.0,0,0,373450,8.05,,S


In [26]:
titanic.Name

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

The `Name` column varies in length and contains passengers' last names, titles, and first names. By extracting the title from the `Name` column, we could infer the **sex** of the passenger and add it as a new feature. This could potentially be a significant predictor of survival, as the "ladies first" principle was often applied during the Titanic evacuation. Adding this feature may enhance the model's ability to predict whether a passenger survived.

In [27]:
# Let's check the length of the name of each passenger
titanic["Name"].str.len()

0      23
1      51
2      22
3      44
4      24
       ..
886    21
887    28
888    40
889    21
890    19
Name: Name, Length: 891, dtype: int64

In [28]:
# check what is the maximum length of the name
titanic.loc[titanic["Name"].str.len().idxmax(), "Name"]

'Penasco y Castellana, Mrs. Victor de Satode (Maria Josefa Perez de Soto y Vallejo)'

In [29]:
# get the observations that contains the word 'Mrs'
titanic[titanic.Name.str.contains('Mrs.')]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,71.2833,C85,C
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1000,C123,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",14.0,1,0,237736,30.0708,,C
15,16,1,2,"Hewlett, Mrs. (Mary D Kingcome)",55.0,0,0,248706,16.0000,,S
...,...,...,...,...,...,...,...,...,...,...,...
871,872,1,1,"Beckwith, Mrs. Richard Leonard (Sallie Monypeny)",47.0,1,1,11751,52.5542,D35,S
874,875,1,2,"Abelson, Mrs. Samuel (Hannah Wizosky)",28.0,1,0,P/PP 3381,24.0000,,C
879,880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",56.0,0,1,11767,83.1583,C50,C
880,881,1,2,"Shelley, Mrs. William (Imanita Parrish Hall)",25.0,0,1,230433,26.0000,,S


In [30]:
# get the observations that contains the word 'Mrs'
titanic.Name.str.count('Mrs.').sum()

129

In [32]:
# split the name column into two columns
titanic.Name.str.split(',', expand=True)

Unnamed: 0,0,1
0,Braund,Mr. Owen Harris
1,Cumings,Mrs. John Bradley (Florence Briggs Thayer)
2,Heikkinen,Miss. Laina
3,Futrelle,Mrs. Jacques Heath (Lily May Peel)
4,Allen,Mr. William Henry
...,...,...
886,Montvila,Rev. Juozas
887,Graham,Miss. Margaret Edith
888,Johnston,"Miss. Catherine Helen ""Carrie"""
889,Behr,Mr. Karl Howell


In [33]:
# get the last part of the split
titanic.Name.str.split(',', expand=True).get(1)

0                                  Mr. Owen Harris
1       Mrs. John Bradley (Florence Briggs Thayer)
2                                      Miss. Laina
3               Mrs. Jacques Heath (Lily May Peel)
4                                Mr. William Henry
                          ...                     
886                                    Rev. Juozas
887                           Miss. Margaret Edith
888                 Miss. Catherine Helen "Carrie"
889                                Mr. Karl Howell
890                                    Mr. Patrick
Name: 1, Length: 891, dtype: object

Create a new column `Title` that contains the title of the passengers

In [34]:
# from the name, extract the title
titanic['Title'] = titanic.Name.apply(lambda x: x.split(',')[1].split('.')[0].strip())
titanic['Title']

0        Mr
1       Mrs
2      Miss
3       Mrs
4        Mr
       ... 
886     Rev
887    Miss
888    Miss
889      Mr
890      Mr
Name: Title, Length: 891, dtype: object

In [35]:
# get the unique titles
titanic.Title.unique()

array(['Mr', 'Mrs', 'Miss', 'Master', 'Don', 'Rev', 'Dr', 'Mme', 'Ms',
       'Major', 'Lady', 'Sir', 'Mlle', 'Col', 'Capt', 'the Countess',
       'Jonkheer'], dtype=object)

Let's create a mapping dictionary next

In [36]:
title_sex_mapping = {
    'Mr': 'Male',
    'Mrs': 'Female',
    'Miss': 'Female',
    'Master': 'Male',
    'Don': 'Male',
    'Rev': 'Male',
    'Dr': 'Male',  # Assumed to be Male unless you have additional context
    'Mme': 'Female',
    'Ms': 'Female',
    'Major': 'Male',
    'Lady': 'Female',
    'Sir': 'Male',
    'Mlle': 'Female',
    'Col': 'Male',
    'Capt': 'Male',
    'the Countess': 'Female',
    'Jonkheer': 'Male'
}

In [38]:
titanic['Sex'] = titanic['Title'].map(title_sex_mapping)
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title,Sex
0,1,0,3,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.25,,S,Mr,Male
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,71.2833,C85,C,Mrs,Female
2,3,1,3,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.925,,S,Miss,Female
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1,C123,S,Mrs,Female
4,5,0,3,"Allen, Mr. William Henry",35.0,0,0,373450,8.05,,S,Mr,Male


In [39]:
titanic.Sex.value_counts()

Sex
Male      578
Female    313
Name: count, dtype: int64

In [40]:
# cross tabulation
pd.crosstab(titanic.Survived, titanic.Sex)

Sex,Female,Male
Survived,Unnamed: 1_level_1,Unnamed: 2_level_1
0,81,468
1,232,110


In the Titanic disaster, more women survived than men due to the social norms and evacuation protocols followed during the sinking. The principle of "women and children first" was enforced when lifeboats were being loaded. Since there were not enough lifeboats for everyone on board, priority was given to women and children, which contributed to the higher survival rate among females compared to males.

#### the `re` module in Python is used for regular expression

The `re` module in Python is used for **regular expressions**, which are powerful tools for text analysis and manipulation. Regular expressions allow you to search, match, and manipulate strings based on specific patterns.

Common Use Cases of `re` in String Text Analysis

* Finding patterns in text: `re.search(r'\d{4}-\d{2}-\d{2}', text)`   # searches for a date in the format YYYY-MM-DD
* Extracting Specific parts of a string: `re.findall(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', text)`  #extracts email addresses from the text.
* Replacing Parts of a String: `re.sub(r'\$\d+', '[price]', text)`    #replacing any price formatted as $<number> with [price].

##### Commonly Used Patterns in `re`

- **`\d`**: Matches any digit (0-9).
- **`\w`**: Matches any alphanumeric character (letters and numbers).
- **`\s`**: Matches any whitespace character (spaces, tabs, newlines).
- **`[a-z]`**: Matches any lowercase letter from `a` to `z`.
- **`[A-Z]`**: Matches any uppercase letter from `A` to `Z`.
- **`*`**: Matches 0 or more occurrences of the preceding character.
- **`+`**: Matches 1 or more occurrences of the preceding character.
- **`?`**: Matches 0 or 1 occurrence of the preceding character.
- **`^`**: Matches the beginning of a string.
- **`$`**: Matches the end of a string.
- **`|`**: Acts as an OR operator.

In [41]:
data = pd.read_html('https://en.wikipedia.org/wiki/List_of_Chicago_Bulls_seasons')
ChicagoBulls = data[2]
ChicagoBulls.head()

Unnamed: 0,Season,Team,Conference,Finish,Division,Finish.1,Wins,Losses,Win%,GB,Playoffs,Awards,Head coach
0,1966–67,1966–67,—,—,Western,4th,33,48,0.407,11,Lost Division semifinals (Hawks) 3–0[19],Johnny Kerr (COY)[6],Johnny Kerr
1,1967–68,1967–68,—,—,Western,4th,29,53,0.354,27,Lost Division semifinals (Lakers) 4–1[20],—,Johnny Kerr
2,1968–69,1968–69,—,—,Western,5th,33,49,0.402,22,,—,Dick Motta
3,1969–70,1969–70,—,—,Western,3rd[c],39,43,0.476,9,Lost Division semifinals (Hawks) 4–1[22],—,Dick Motta
4,1970–71,1970–71,Western,3rd,Midwest[d],2nd,51,31,0.622,2,Lost conference semifinals (Lakers) 4–3[23],Dick Motta (COY)[6],Dick Motta


In [42]:
# remove all charaters between box brackets inluding the brackets themselves in the columns Division, Finish, Playoffs, Awards
import re
def remove_brackets(x):
    return re.sub(r'\[.*?\]', '', x)

# Apply the function to each column separately using map
ChicagoBulls['Division'] = ChicagoBulls['Division'].map(remove_brackets)
ChicagoBulls['Finish'] = ChicagoBulls['Finish'].map(remove_brackets)
ChicagoBulls['Finish.1'] = ChicagoBulls['Finish.1'].map(remove_brackets)

ChicagoBulls.head()

Unnamed: 0,Season,Team,Conference,Finish,Division,Finish.1,Wins,Losses,Win%,GB,Playoffs,Awards,Head coach
0,1966–67,1966–67,—,—,Western,4th,33,48,0.407,11,Lost Division semifinals (Hawks) 3–0[19],Johnny Kerr (COY)[6],Johnny Kerr
1,1967–68,1967–68,—,—,Western,4th,29,53,0.354,27,Lost Division semifinals (Lakers) 4–1[20],—,Johnny Kerr
2,1968–69,1968–69,—,—,Western,5th,33,49,0.402,22,,—,Dick Motta
3,1969–70,1969–70,—,—,Western,3rd,39,43,0.476,9,Lost Division semifinals (Hawks) 4–1[22],—,Dick Motta
4,1970–71,1970–71,Western,3rd,Midwest,2nd,51,31,0.622,2,Lost conference semifinals (Lakers) 4–3[23],Dick Motta (COY)[6],Dick Motta


Another example

In [43]:

gdp_data = pd.read_html("https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita")[1]

gdp_data.head()

Unnamed: 0_level_0,Country/Territory,IMF[4][5],IMF[4][5],World Bank[6],World Bank[6],United Nations[7],United Nations[7]
Unnamed: 0_level_1,Country/Territory,Estimate,Year,Estimate,Year,Estimate,Year
0,Monaco,—,—,240862,2022,240535,2022
1,Liechtenstein,—,—,187267,2022,197268,2022
2,Luxembourg,135321,2024,128259,2023,125897,2022
3,Bermuda,—,—,123091,2022,117568,2022
4,Switzerland,106098,2024,99995,2023,93636,2022


In [45]:
# drop the year column
gdp_data.drop(columns=["Year"], level=1, inplace=True, axis=1)

# drop the level 1 column
gdp_data = gdp_data.droplevel(1, axis=1)

In [46]:

column_name_cleaner = lambda x:re.split(r'\[', x)[0]

gdp_data.columns = gdp_data.columns.map(column_name_cleaner)

gdp_data.head()

Unnamed: 0,Country/Territory,IMF,World Bank,United Nations
0,Monaco,—,240862,240535
1,Liechtenstein,—,187267,197268
2,Luxembourg,135321,128259,125897
3,Bermuda,—,123091,117568
4,Switzerland,106098,99995,93636


#### the `NLTK` Library for NLP (skipped)

NLTK (Natural Language Toolkit) is a popular Python library for natural language processing (NLP) and text analysis. It provides a wide range of tools and resources for processing and analyzing human language data. NLTK is widely used in research, education, and industry for various text analysis tasks.

##### Key Features and Capabilities of NLTK

* **Tokenization**: Splits text into individual words (word tokenization) or sentences (sentence tokenization).
* **Stop Word Removal**: Provides lists of common words (like "and", "the", "is") in various languages that can be removed from text to reduce noise.
* **Stemming**: Reduces words to their root form (e.g., "running" to "run").
* **Lemmatization**: Similar to stemming, but more sophisticated. It reduces words to their dictionary form using vocabulary and morphological analysis (e.g., "better" to "good").

### Working with `numerical` Data 

#### Summary statistics across rows/columns in Pandas

The Pandas DataFrame class has functions such as `sum()` and `mean()` to compute sum over rows or columns of a DataFrame.

By default, functions like `mean()` and `sum()` compute the statistics for each column (i.e., all rows are aggregated) in the DataFrame.
Let us compute the mean of all the numeric columns of the data:

In [None]:
movie_ratings.describe()

Unnamed: 0,US Gross,Worldwide Gross,Production Budget,IMDB Rating,IMDB Votes
count,2228.0,2228.0,2228.0,2228.0,2228.0
mean,50763700.0,101937000.0,38160550.0,6.239004,33585.154847
std,66430810.0,164858900.0,37826040.0,1.243285,47325.651561
min,0.0,884.0,218.0,1.4,18.0
25%,9646188.0,13207370.0,12000000.0,5.5,6659.25
50%,28386490.0,42668920.0,26000000.0,6.4,18169.0
75%,64531400.0,120000000.0,53000000.0,7.1,40092.75
max,760167600.0,2767891000.0,300000000.0,9.2,519541.0


In [None]:
# select the numeric columns
movie_ratings.mean(numeric_only=True)

US Gross                  5.076370e+07
Worldwide Gross           1.019370e+08
Production Budget         3.816055e+07
IMDB Rating               6.239004e+00
IMDB Votes                3.358515e+04
Release Year              2.002005e+03
ratio_wgross_by_budget    1.259483e+01
dtype: float64

 **Using the `axis` parameter**:

The `axis` parameter controls whether to compute the statistic across rows or columns:
* The argument `axis=0`(deafult) denotes that the mean is taken over all the rows of the DataFrame. 
* For computing a statistic across column the argument `axis=1` will be used.

If mean over a subset of columns is desired, then those column names can be subset from the data. 

For example, let us compute the mean IMDB rating, and mean IMDB votes of all the movies:


In [None]:
movie_ratings[['IMDB Rating','IMDB Votes']].mean(axis = 0)

IMDB Rating        6.239004
IMDB Votes     33585.154847
dtype: float64

 **Pandas `sum`  function**

In [None]:
data = [[10, 18, 11], [13, 15, 8], [9, 20, 3]]
df = pd.DataFrame(data )
df

Unnamed: 0,0,1,2
0,10,18,11
1,13,15,8
2,9,20,3


In [None]:
# By default, the sum method adds values accross rows and returns the sum for each column
df.sum()

0    32
1    53
2    22
dtype: int64

In [None]:
# By specifying the column axis (axis='columns'), the sum() method add values accross columns and returns the sum of each row.
df.sum(axis = 'columns')

0    39
1    36
2    32
dtype: int64

In [None]:
# in python, axis=1 stands for column, while axis=0 stands for rows
df.sum(axis = 1)

0    39
1    36
2    32
dtype: int64