# Pandas 101 (Part 2)

***Author:*** Leandro Ariza (ariza.leandro@gmail.com)

## 5. Data Manipulation

### 5.1. Adding and deleting columns

You can add new columns to a DataFrame or delete existing columns using square brackets `[]` or the `drop()` method:

In [13]:
# Load the data used in the first part
import pandas as pd
import numpy as np

df_fortune = pd.read_csv("./data/fortune1000.csv", header="infer", sep=",")
df_fortune

Unnamed: 0,Company,Revenues,Profits,Employees,Sector,Industry
0,Walmart,500343.0,9862.0,2300000,Retailing,General Merchandisers
1,Exxon Mobil,244363.0,19710.0,71200,Energy,Petroleum Refining
2,Berkshire Hathaway,242137.0,44940.0,377000,Financials,Insurance: Property and Casualty (Stock)
3,Apple,229234.0,48351.0,123000,Technology,"Computers, Office Equipment"
4,UnitedHealth Group,201159.0,10558.0,260000,Health Care,Health Care: Insurance and Managed Care
...,...,...,...,...,...,...
995,SiteOne Landscape Supply,1862.0,54.6,3664,Wholesalers,Wholesalers: Diversified
996,Charles River Laboratories Intl,1858.0,123.4,11800,Health Care,Health Care: Pharmacy and Other Services
997,CoreLogic,1851.0,152.2,5900,Business Services,Financial Data Services
998,Ensign Group,1849.0,40.5,21301,Health Care,Health Care: Medical Facilities


In [14]:
# Always remember to use the copy() method before manipulating data in Pandas
# to ensure that the original DataFrame remains unchanged.
# This precaution is necessary because indexing returns a reference to the
# original DataFrame, leading to unintended changes.
df_fortune_cp = df_fortune.copy()

In [15]:
# Adding a new column with constant value
df_fortune_cp["Country"] = "USA"
df_fortune_cp

Unnamed: 0,Company,Revenues,Profits,Employees,Sector,Industry,Country
0,Walmart,500343.0,9862.0,2300000,Retailing,General Merchandisers,USA
1,Exxon Mobil,244363.0,19710.0,71200,Energy,Petroleum Refining,USA
2,Berkshire Hathaway,242137.0,44940.0,377000,Financials,Insurance: Property and Casualty (Stock),USA
3,Apple,229234.0,48351.0,123000,Technology,"Computers, Office Equipment",USA
4,UnitedHealth Group,201159.0,10558.0,260000,Health Care,Health Care: Insurance and Managed Care,USA
...,...,...,...,...,...,...,...
995,SiteOne Landscape Supply,1862.0,54.6,3664,Wholesalers,Wholesalers: Diversified,USA
996,Charles River Laboratories Intl,1858.0,123.4,11800,Health Care,Health Care: Pharmacy and Other Services,USA
997,CoreLogic,1851.0,152.2,5900,Business Services,Financial Data Services,USA
998,Ensign Group,1849.0,40.5,21301,Health Care,Health Care: Medical Facilities,USA


In [16]:
# Inserting a column/value at a given position
value_to_insert = "Fortune"
df_fortune_cp.insert(loc=0, column="Source", value=value_to_insert)
df_fortune_cp

Unnamed: 0,Source,Company,Revenues,Profits,Employees,Sector,Industry,Country
0,Fortune,Walmart,500343.0,9862.0,2300000,Retailing,General Merchandisers,USA
1,Fortune,Exxon Mobil,244363.0,19710.0,71200,Energy,Petroleum Refining,USA
2,Fortune,Berkshire Hathaway,242137.0,44940.0,377000,Financials,Insurance: Property and Casualty (Stock),USA
3,Fortune,Apple,229234.0,48351.0,123000,Technology,"Computers, Office Equipment",USA
4,Fortune,UnitedHealth Group,201159.0,10558.0,260000,Health Care,Health Care: Insurance and Managed Care,USA
...,...,...,...,...,...,...,...,...
995,Fortune,SiteOne Landscape Supply,1862.0,54.6,3664,Wholesalers,Wholesalers: Diversified,USA
996,Fortune,Charles River Laboratories Intl,1858.0,123.4,11800,Health Care,Health Care: Pharmacy and Other Services,USA
997,Fortune,CoreLogic,1851.0,152.2,5900,Business Services,Financial Data Services,USA
998,Fortune,Ensign Group,1849.0,40.5,21301,Health Care,Health Care: Medical Facilities,USA


In [17]:
# Adding a new column based on another column
df_fortune_cp["Profits_neg"] = df_fortune_cp.Profits < 0
df_fortune_cp

Unnamed: 0,Source,Company,Revenues,Profits,Employees,Sector,Industry,Country,Profits_neg
0,Fortune,Walmart,500343.0,9862.0,2300000,Retailing,General Merchandisers,USA,False
1,Fortune,Exxon Mobil,244363.0,19710.0,71200,Energy,Petroleum Refining,USA,False
2,Fortune,Berkshire Hathaway,242137.0,44940.0,377000,Financials,Insurance: Property and Casualty (Stock),USA,False
3,Fortune,Apple,229234.0,48351.0,123000,Technology,"Computers, Office Equipment",USA,False
4,Fortune,UnitedHealth Group,201159.0,10558.0,260000,Health Care,Health Care: Insurance and Managed Care,USA,False
...,...,...,...,...,...,...,...,...,...
995,Fortune,SiteOne Landscape Supply,1862.0,54.6,3664,Wholesalers,Wholesalers: Diversified,USA,False
996,Fortune,Charles River Laboratories Intl,1858.0,123.4,11800,Health Care,Health Care: Pharmacy and Other Services,USA,False
997,Fortune,CoreLogic,1851.0,152.2,5900,Business Services,Financial Data Services,USA,False
998,Fortune,Ensign Group,1849.0,40.5,21301,Health Care,Health Care: Medical Facilities,USA,False


In [18]:
# Adding a new column based on another column
df_fortune_cp['Expenses'] = df_fortune_cp['Revenues'] - df_fortune_cp['Profits']
df_fortune_cp

Unnamed: 0,Source,Company,Revenues,Profits,Employees,Sector,Industry,Country,Profits_neg,Expenses
0,Fortune,Walmart,500343.0,9862.0,2300000,Retailing,General Merchandisers,USA,False,490481.0
1,Fortune,Exxon Mobil,244363.0,19710.0,71200,Energy,Petroleum Refining,USA,False,224653.0
2,Fortune,Berkshire Hathaway,242137.0,44940.0,377000,Financials,Insurance: Property and Casualty (Stock),USA,False,197197.0
3,Fortune,Apple,229234.0,48351.0,123000,Technology,"Computers, Office Equipment",USA,False,180883.0
4,Fortune,UnitedHealth Group,201159.0,10558.0,260000,Health Care,Health Care: Insurance and Managed Care,USA,False,190601.0
...,...,...,...,...,...,...,...,...,...,...
995,Fortune,SiteOne Landscape Supply,1862.0,54.6,3664,Wholesalers,Wholesalers: Diversified,USA,False,1807.4
996,Fortune,Charles River Laboratories Intl,1858.0,123.4,11800,Health Care,Health Care: Pharmacy and Other Services,USA,False,1734.6
997,Fortune,CoreLogic,1851.0,152.2,5900,Business Services,Financial Data Services,USA,False,1698.8
998,Fortune,Ensign Group,1849.0,40.5,21301,Health Care,Health Care: Medical Facilities,USA,False,1808.5


In [19]:
# Add a new column based on another column and
# update values using direct assignment
df_fortune_cp['Company_size'] = 'Small'
df_fortune_cp.loc[(df_fortune_cp.Employees > 1000) & (df_fortune_cp.Employees <= 10000), 'Company_size'] = 'Medium'
df_fortune_cp.loc[(df_fortune_cp.Employees > 10000) & (df_fortune_cp.Employees <= 100000), 'Company_size'] = 'Big'
df_fortune_cp.loc[(df_fortune_cp.Employees > 100000), 'Company_size'] = 'Huge'
df_fortune_cp

Unnamed: 0,Source,Company,Revenues,Profits,Employees,Sector,Industry,Country,Profits_neg,Expenses,Company_size
0,Fortune,Walmart,500343.0,9862.0,2300000,Retailing,General Merchandisers,USA,False,490481.0,Huge
1,Fortune,Exxon Mobil,244363.0,19710.0,71200,Energy,Petroleum Refining,USA,False,224653.0,Big
2,Fortune,Berkshire Hathaway,242137.0,44940.0,377000,Financials,Insurance: Property and Casualty (Stock),USA,False,197197.0,Huge
3,Fortune,Apple,229234.0,48351.0,123000,Technology,"Computers, Office Equipment",USA,False,180883.0,Huge
4,Fortune,UnitedHealth Group,201159.0,10558.0,260000,Health Care,Health Care: Insurance and Managed Care,USA,False,190601.0,Huge
...,...,...,...,...,...,...,...,...,...,...,...
995,Fortune,SiteOne Landscape Supply,1862.0,54.6,3664,Wholesalers,Wholesalers: Diversified,USA,False,1807.4,Medium
996,Fortune,Charles River Laboratories Intl,1858.0,123.4,11800,Health Care,Health Care: Pharmacy and Other Services,USA,False,1734.6,Big
997,Fortune,CoreLogic,1851.0,152.2,5900,Business Services,Financial Data Services,USA,False,1698.8,Medium
998,Fortune,Ensign Group,1849.0,40.5,21301,Health Care,Health Care: Medical Facilities,USA,False,1808.5,Big


In [20]:
# Adding a new column using `assign()` method
df_fortune_cp = df_fortune_cp.assign(Profit_per_employee = lambda x: x.Profits/x.Employees)
df_fortune_cp

Unnamed: 0,Source,Company,Revenues,Profits,Employees,Sector,Industry,Country,Profits_neg,Expenses,Company_size,Profit_per_employee
0,Fortune,Walmart,500343.0,9862.0,2300000,Retailing,General Merchandisers,USA,False,490481.0,Huge,0.004288
1,Fortune,Exxon Mobil,244363.0,19710.0,71200,Energy,Petroleum Refining,USA,False,224653.0,Big,0.276826
2,Fortune,Berkshire Hathaway,242137.0,44940.0,377000,Financials,Insurance: Property and Casualty (Stock),USA,False,197197.0,Huge,0.119204
3,Fortune,Apple,229234.0,48351.0,123000,Technology,"Computers, Office Equipment",USA,False,180883.0,Huge,0.393098
4,Fortune,UnitedHealth Group,201159.0,10558.0,260000,Health Care,Health Care: Insurance and Managed Care,USA,False,190601.0,Huge,0.040608
...,...,...,...,...,...,...,...,...,...,...,...,...
995,Fortune,SiteOne Landscape Supply,1862.0,54.6,3664,Wholesalers,Wholesalers: Diversified,USA,False,1807.4,Medium,0.014902
996,Fortune,Charles River Laboratories Intl,1858.0,123.4,11800,Health Care,Health Care: Pharmacy and Other Services,USA,False,1734.6,Big,0.010458
997,Fortune,CoreLogic,1851.0,152.2,5900,Business Services,Financial Data Services,USA,False,1698.8,Medium,0.025797
998,Fortune,Ensign Group,1849.0,40.5,21301,Health Care,Health Care: Medical Facilities,USA,False,1808.5,Big,0.001901


In [21]:
# Deleting columns
df_fortune_cp = df_fortune_cp.drop(columns=["Source", "Country"])
df_fortune_cp

Unnamed: 0,Company,Revenues,Profits,Employees,Sector,Industry,Profits_neg,Expenses,Company_size,Profit_per_employee
0,Walmart,500343.0,9862.0,2300000,Retailing,General Merchandisers,False,490481.0,Huge,0.004288
1,Exxon Mobil,244363.0,19710.0,71200,Energy,Petroleum Refining,False,224653.0,Big,0.276826
2,Berkshire Hathaway,242137.0,44940.0,377000,Financials,Insurance: Property and Casualty (Stock),False,197197.0,Huge,0.119204
3,Apple,229234.0,48351.0,123000,Technology,"Computers, Office Equipment",False,180883.0,Huge,0.393098
4,UnitedHealth Group,201159.0,10558.0,260000,Health Care,Health Care: Insurance and Managed Care,False,190601.0,Huge,0.040608
...,...,...,...,...,...,...,...,...,...,...
995,SiteOne Landscape Supply,1862.0,54.6,3664,Wholesalers,Wholesalers: Diversified,False,1807.4,Medium,0.014902
996,Charles River Laboratories Intl,1858.0,123.4,11800,Health Care,Health Care: Pharmacy and Other Services,False,1734.6,Big,0.010458
997,CoreLogic,1851.0,152.2,5900,Business Services,Financial Data Services,False,1698.8,Medium,0.025797
998,Ensign Group,1849.0,40.5,21301,Health Care,Health Care: Medical Facilities,False,1808.5,Big,0.001901


### 5.2. Updating Values

You can update values in a DataFrame or Series using direct assignment or the `replace()` method:

In [224]:
# Updating values using direct assignment based on loc[]
df_fortune_cp.loc[df_fortune_cp.Company == "Apple", "Company"] = "Apple Inc."
df_fortune_cp.query("Company == 'Apple Inc.'")

Unnamed: 0,Company,Revenues,Profits,Employees,Sector,Industry,Profits_neg,Expenses,Company_size,Profit_per_employee
3,Apple Inc.,229234.0,48351.0,123000,Technology,"Computers, Office Equipment",False,180883.0,Huge,0.393098


In [22]:
df_fortune_cp.Sector.value_counts()

Sector
Financials                       155
Energy                           107
Technology                       103
Retailing                         77
Health Care                       71
Business Services                 53
Industrials                       49
Materials                         45
Wholesalers                       44
Transportation                    40
Food, Beverages & Tobacco         37
Chemicals                         33
Household Products                28
Engineering & Construction        27
Hotels, Restaurants & Leisure     26
Media                             25
Aerospace & Defense               25
Motor Vehicles & Parts            19
Apparel                           14
Food &  Drug Stores               12
Telecommunications                10
Name: count, dtype: int64

In [24]:
df_fortune_cp.Sector.isin(['Food, Beverages & Tobacco', 'Food &  Drug Stores'])

49

In [26]:
# Updating values using direct assignment based on loc[]
mask = (df_fortune_cp.Sector == 'Food, Beverages & Tobacco') | (df_fortune_cp.Sector == 'Food &  Drug Stores')
df_fortune_cp.loc[mask, 'Sector'] = 'Food, Beverages, Drug Stores & Tobacco'
df_fortune_cp.Sector.value_counts()

Sector
Financials                                155
Energy                                    107
Technology                                103
Retailing                                  77
Health Care                                71
Business Services                          53
Food, Beverages, Drug Stores & Tobacco     49
Industrials                                49
Materials                                  45
Wholesalers                                44
Transportation                             40
Chemicals                                  33
Household Products                         28
Engineering & Construction                 27
Hotels, Restaurants & Leisure              26
Media                                      25
Aerospace & Defense                        25
Motor Vehicles & Parts                     19
Apparel                                    14
Telecommunications                         10
Name: count, dtype: int64

In [28]:
# Replacing a single value (in a series) using replace()
df_fortune_cp["Sector"] = df_fortune_cp.Sector.replace(
    to_replace="Telecommunications",
    value="Telecom"
)
df_fortune_cp.Sector.value_counts()

Sector
Financials                                155
Energy                                    107
Technology                                103
Retailing                                  77
Health Care                                71
Business Services                          53
Food, Beverages, Drug Stores & Tobacco     49
Industrials                                49
Materials                                  45
Wholesalers                                44
Transportation                             40
Chemicals                                  33
Household Products                         28
Engineering & Construction                 27
Hotels, Restaurants & Leisure              26
Media                                      25
Aerospace & Defense                        25
Motor Vehicles & Parts                     19
Apparel                                    14
Telecom                                    10
Name: count, dtype: int64

In [29]:
# Replacing multiple values with a single value (in a series) using replace()
df_fortune_cp["Sector"] = df_fortune_cp.Sector.replace(
    to_replace=["Transportation", "Motor Vehicles & Parts"],
    value="Automotive & Transportation"
)
df_fortune_cp.Sector.value_counts()

Sector
Financials                                155
Energy                                    107
Technology                                103
Retailing                                  77
Health Care                                71
Automotive & Transportation                59
Business Services                          53
Food, Beverages, Drug Stores & Tobacco     49
Industrials                                49
Materials                                  45
Wholesalers                                44
Chemicals                                  33
Household Products                         28
Engineering & Construction                 27
Hotels, Restaurants & Leisure              26
Aerospace & Defense                        25
Media                                      25
Apparel                                    14
Telecom                                    10
Name: count, dtype: int64

In [30]:
# Replacing values with another set of values (in a series) using replace()
df_fortune_cp["Sector"] = df_fortune_cp.Sector.replace(
    to_replace={"Telecom": "Telecommunications", "Technology": "Tech"}
)
df_fortune_cp.Sector.value_counts()

Sector
Financials                                155
Energy                                    107
Tech                                      103
Retailing                                  77
Health Care                                71
Automotive & Transportation                59
Business Services                          53
Food, Beverages, Drug Stores & Tobacco     49
Industrials                                49
Materials                                  45
Wholesalers                                44
Chemicals                                  33
Household Products                         28
Engineering & Construction                 27
Hotels, Restaurants & Leisure              26
Aerospace & Defense                        25
Media                                      25
Apparel                                    14
Telecommunications                         10
Name: count, dtype: int64

### 5.3. Applying Functions to Data

You can apply custom functions or NumPy functions to data in a DataFrame or Series using the `apply()` method:

In [37]:
# Applying a function to a Series
df_fortune_cp['Profits_sq'] = df_fortune_cp['Profits'].apply(lambda x: x**2)
df_fortune_cp

Unnamed: 0,Company,Revenues,Profits,Employees,Sector,Industry,Profits_neg,Expenses,Company_size,Profit_per_employee,Profits_sq
0,Walmart,500343.0,9862.0,2300000,Retailing,General Merchandisers,False,490481.0,Huge,0.004288,9.725904e+07
1,Exxon Mobil,244363.0,19710.0,71200,Energy,Petroleum Refining,False,224653.0,Big,0.276826,3.884841e+08
2,Berkshire Hathaway,242137.0,44940.0,377000,Financials,Insurance: Property and Casualty (Stock),False,197197.0,Huge,0.119204,2.019604e+09
3,Apple,229234.0,48351.0,123000,Tech,"Computers, Office Equipment",False,180883.0,Huge,0.393098,2.337819e+09
4,UnitedHealth Group,201159.0,10558.0,260000,Health Care,Health Care: Insurance and Managed Care,False,190601.0,Huge,0.040608,1.114714e+08
...,...,...,...,...,...,...,...,...,...,...,...
995,SiteOne Landscape Supply,1862.0,54.6,3664,Wholesalers,Wholesalers: Diversified,False,1807.4,Medium,0.014902,2.981160e+03
996,Charles River Laboratories Intl,1858.0,123.4,11800,Health Care,Health Care: Pharmacy and Other Services,False,1734.6,Big,0.010458,1.522756e+04
997,CoreLogic,1851.0,152.2,5900,Business Services,Financial Data Services,False,1698.8,Medium,0.025797,2.316484e+04
998,Ensign Group,1849.0,40.5,21301,Health Care,Health Care: Medical Facilities,False,1808.5,Big,0.001901,1.640250e+03


In [40]:
# Applying a function to a Series
df_fortune_cp['Revenues_log'] = df_fortune_cp['Revenues'].apply(lambda x: np.log(x))
df_fortune_cp

Unnamed: 0,Company,Revenues,Profits,Employees,Sector,Industry,Profits_neg,Expenses,Company_size,Profit_per_employee,Profits_sq,Revenues_log
0,Walmart,500343.0,9862.0,2300000,Retailing,General Merchandisers,False,490481.0,Huge,0.004288,9.725904e+07,13.123049
1,Exxon Mobil,244363.0,19710.0,71200,Energy,Petroleum Refining,False,224653.0,Big,0.276826,3.884841e+08,12.406410
2,Berkshire Hathaway,242137.0,44940.0,377000,Financials,Insurance: Property and Casualty (Stock),False,197197.0,Huge,0.119204,2.019604e+09,12.397259
3,Apple,229234.0,48351.0,123000,Tech,"Computers, Office Equipment",False,180883.0,Huge,0.393098,2.337819e+09,12.342499
4,UnitedHealth Group,201159.0,10558.0,260000,Health Care,Health Care: Insurance and Managed Care,False,190601.0,Huge,0.040608,1.114714e+08,12.211851
...,...,...,...,...,...,...,...,...,...,...,...,...
995,SiteOne Landscape Supply,1862.0,54.6,3664,Wholesalers,Wholesalers: Diversified,False,1807.4,Medium,0.014902,2.981160e+03,7.529406
996,Charles River Laboratories Intl,1858.0,123.4,11800,Health Care,Health Care: Pharmacy and Other Services,False,1734.6,Big,0.010458,1.522756e+04,7.527256
997,CoreLogic,1851.0,152.2,5900,Business Services,Financial Data Services,False,1698.8,Medium,0.025797,2.316484e+04,7.523481
998,Ensign Group,1849.0,40.5,21301,Health Care,Health Care: Medical Facilities,False,1808.5,Big,0.001901,1.640250e+03,7.522400


## 6. Data Cleaning and Preprocessing

### 6.1. Handling Missing Data

Pandas provides methods for handling missing data, such as `dropna()` to remove rows or columns with missing values, and `fillna()` to fill missing values with specified values:

In [44]:
# First, count missings using isna() and sum() methods
df_fortune_cp.isna().sum()

Company                0
Revenues               0
Profits                2
Employees              0
Sector                 0
Industry               0
Profits_neg            0
Expenses               2
Company_size           0
Profit_per_employee    2
Profits_sq             2
Revenues_log           0
dtype: int64

In [232]:
# Missing percent using isna() and mean() methods
df_fortune_cp.isna().mean()

Company                0.000
Revenues               0.000
Profits                0.002
Employees              0.000
Sector                 0.000
Industry               0.000
Profits_neg            0.000
Expenses               0.002
Company_size           0.000
Profit_per_employee    0.002
Profits_sq             0.002
Revenues_log           0.000
dtype: float64

In [45]:
# Find the missings using isna() annd boolean indexing
df_fortune_cp[df_fortune_cp.Profits.isna()]

Unnamed: 0,Company,Revenues,Profits,Employees,Sector,Industry,Profits_neg,Expenses,Company_size,Profit_per_employee,Profits_sq,Revenues_log
0,Walmart,500343.0,9862.0,2300000,Retailing,General Merchandisers,False,490481.0,Huge,0.004288,9.725904e+07,13.123049
1,Exxon Mobil,244363.0,19710.0,71200,Energy,Petroleum Refining,False,224653.0,Big,0.276826,3.884841e+08,12.406410
2,Berkshire Hathaway,242137.0,44940.0,377000,Financials,Insurance: Property and Casualty (Stock),False,197197.0,Huge,0.119204,2.019604e+09,12.397259
3,Apple,229234.0,48351.0,123000,Tech,"Computers, Office Equipment",False,180883.0,Huge,0.393098,2.337819e+09,12.342499
4,UnitedHealth Group,201159.0,10558.0,260000,Health Care,Health Care: Insurance and Managed Care,False,190601.0,Huge,0.040608,1.114714e+08,12.211851
...,...,...,...,...,...,...,...,...,...,...,...,...
995,SiteOne Landscape Supply,1862.0,54.6,3664,Wholesalers,Wholesalers: Diversified,False,1807.4,Medium,0.014902,2.981160e+03,7.529406
996,Charles River Laboratories Intl,1858.0,123.4,11800,Health Care,Health Care: Pharmacy and Other Services,False,1734.6,Big,0.010458,1.522756e+04,7.527256
997,CoreLogic,1851.0,152.2,5900,Business Services,Financial Data Services,False,1698.8,Medium,0.025797,2.316484e+04,7.523481
998,Ensign Group,1849.0,40.5,21301,Health Care,Health Care: Medical Facilities,False,1808.5,Big,0.001901,1.640250e+03,7.522400


In [46]:
# Remove rows with missing values
df_fortune_cp = df_fortune_cp.dropna(subset="Profits", how="any")
df_fortune_cp.isna().sum()

Company                0
Revenues               0
Profits                0
Employees              0
Sector                 0
Industry               0
Profits_neg            0
Expenses               0
Company_size           0
Profit_per_employee    0
Profits_sq             0
Revenues_log           0
dtype: int64

In [235]:
# Filling missing values with a specified value
# df_fortune_cp.fillna(-999)

### 6.2. Data Type Conversions

You can convert data types of columns using the `astype()` method:

In [47]:
# Converting data types of columns
# df_fortune_cp["Profits_neg"] = df_fortune_cp["Profits_neg"].astype(int)
df_fortune_cp.loc[:, "Profits_neg"] = df_fortune_cp["Profits_neg"].astype(int)
df_fortune_cp["Profits_neg"].value_counts()

Profits_neg
0    871
1    127
Name: count, dtype: int64

### 6.3. Renaming Columns

You can rename columns using the `rename()` method or by directly assigning to the columns attribute:

In [237]:
# Rename ALL columns by directly assigning to columns attribute
# df.columns = ['New_Name1', 'New_Name2', ...]

In [48]:
# Renaming specific columns using rename()
df_fortune_cp = df_fortune_cp.rename(columns={'Profits_sq': 'Profits_squared'})
df_fortune_cp.columns

Index(['Company', 'Revenues', 'Profits', 'Employees', 'Sector', 'Industry',
       'Profits_neg', 'Expenses', 'Company_size', 'Profit_per_employee',
       'Profits_squared', 'Revenues_log'],
      dtype='object')

### 6.4. Find and drop duplicates

You can find duplicate rows using the `duplicated()` method and drop them using `drop_duplicates()`:

In [49]:
# For convenience, obtain a sample from the complete dataframe
df_tmp = df_fortune_cp.filter(items=["Company", "Sector", "Profits_neg"]).sample(n=10, random_state=42)
df_tmp

Unnamed: 0,Company,Sector,Profits_neg
453,Arthur J. Gallagher,Financials,0
794,Revlon,Household Products,1
209,Principal Financial,Financials,0
309,Eastman Chemical,Chemicals,0
741,Cracker Barrel Old Country Store,"Hotels, Restaurants & Leisure",0
579,CNO Financial Group,Financials,0
852,Primoris Services,Engineering & Construction,0
546,Xylem,Industrials,0
436,Alleghany,Financials,0
678,Carters,Apparel,0


In [52]:
# Finding ALL the duplicate rows
mask = df_tmp.duplicated(subset=["Sector", "Profits_neg"], keep=False)
df_tmp[mask]

Unnamed: 0,Company,Sector,Profits_neg
453,Arthur J. Gallagher,Financials,0
209,Principal Financial,Financials,0
579,CNO Financial Group,Financials,0
436,Alleghany,Financials,0


In [241]:
# Finding duplicate rows, ignore first occurrence
mask = df_tmp.duplicated(subset=["Sector", "Profits_neg"], keep="first")
df_tmp[mask]

Unnamed: 0,Company,Sector,Profits_neg
209,Principal Financial,Financials,0
579,CNO Financial Group,Financials,0
436,Alleghany,Financials,0


In [242]:
# Dropping duplicate rows
df_tmp.drop_duplicates(subset=["Sector", "Profits_neg"], keep=False)

Unnamed: 0,Company,Sector,Profits_neg
794,Revlon,Household Products,1
309,Eastman Chemical,Chemicals,0
741,Cracker Barrel Old Country Store,"Hotels, Restaurants & Leisure",0
852,Primoris Services,Engineering & Construction,0
546,Xylem,Industrials,0
678,Carters,Apparel,0


In [54]:
# Dropping duplicate rows, keep first occurrence
df_tmp.drop_duplicates(subset=["Sector", "Profits_neg"], keep="first")

Unnamed: 0,Company,Sector,Profits_neg
453,Arthur J. Gallagher,Financials,0
794,Revlon,Household Products,1
309,Eastman Chemical,Chemicals,0
741,Cracker Barrel Old Country Store,"Hotels, Restaurants & Leisure",0
852,Primoris Services,Engineering & Construction,0
546,Xylem,Industrials,0
678,Carters,Apparel,0


## 7. Data Aggregation and Grouping

### 7.1. Grouping Data

Pandas' `groupby()` method allows you to group DataFrame rows based on one or more columns.

In [55]:
# Grouping data by a single column
grouped = df_fortune_cp.groupby("Sector")
grouped

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002768A318AD0>

In [56]:
# Grouping data by multiple columns
grouped = df_fortune_cp.groupby(["Sector", "Company_size"])
grouped

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002768A0F6E90>

### 7.2. Aggregating Data

In Pandas, after obtaining groups, functions can be applied to them.
This process typically follows a 'split-apply-combine' operation:

1. Splitting data into groups based on criteria.
2. Applying a function to each group independently.
3. Combining the results into a structured format.

Common aggregation functions include operations such as `sum()`, `mean()`, `count()`, `size()`, etc.

In [57]:
# Aggregating data using built-in functions
grouped = df_fortune_cp.groupby("Sector")
grouped.size()

Sector
Aerospace & Defense                        25
Apparel                                    14
Automotive & Transportation                59
Business Services                          53
Chemicals                                  33
Energy                                    106
Engineering & Construction                 27
Financials                                155
Food, Beverages, Drug Stores & Tobacco     49
Health Care                                71
Hotels, Restaurants & Leisure              26
Household Products                         28
Industrials                                49
Materials                                  45
Media                                      25
Retailing                                  77
Tech                                      102
Telecommunications                         10
Wholesalers                                44
dtype: int64

In [60]:
# Aggregating data using built-in functions
(
    df_fortune_cp
    .groupby('Company_size', as_index=False).size()
    .rename(columns={'size': 'Number'})
    .sort_values(by='Number', ascending=False)
)

Unnamed: 0,Company_size,Number
0,Big,520
2,Medium,397
1,Huge,60
3,Small,21


In [63]:
# Aggregating data using built-in functions
(
    df_fortune_cp
    .groupby('Profits_neg', as_index=False)["Profits_neg"]
    .size()
)

Unnamed: 0,Profits_neg,size
0,0,871
1,1,127


In [249]:
# Aggregating data using built-in functions
(
    df_fortune_cp
    .groupby('Company_size', as_index=False)["Profits"]
    .sum()
)

Unnamed: 0,Company_size,Profits
0,Big,558032.1
1,Huge,396645.7
2,Medium,145142.9
3,Small,9728.3


In [250]:
# Aggregating data using multiple columns
(
    df_fortune_cp
    .groupby('Company_size', as_index=False)[['Profits', 'Employees']]
    .mean()
    .sort_values(by='Profits', ascending=False)
)

Unnamed: 0,Company_size,Profits,Employees
1,Huge,6610.761667,254000.5
0,Big,1073.138654,31072.013462
3,Small,463.252381,621.0
2,Medium,365.599244,5607.549118


In [64]:
# Aggregating data using multiple operations
(
    df_fortune_cp
    .groupby('Company_size', as_index=False)
    .agg({'Employees': ['min', 'median', 'max', 'count']})
)

Unnamed: 0_level_0,Company_size,Employees,Employees,Employees,Employees
Unnamed: 0_level_1,Unnamed: 1_level_1,min,median,max,count
0,Big,10015,22100.0,100000,520
1,Huge,102700,200500.0,2300000,60
2,Medium,1047,5600.0,10000,397
3,Small,126,593.0,981,21


In [65]:
# Aggregating data using multiple operations
aggregations = {
    'Revenues': 'min',
    'Profits': 'max',
    'Employees': 'mean'
}
df_fortune_cp.groupby('Company_size', as_index=False).agg(aggregations)

Unnamed: 0,Company_size,Revenues,Profits,Employees
0,Big,1849.0,21308.0,31072.013462
1,Huge,2792.0,48351.0,254000.5
2,Medium,1851.0,10222.0,5607.549118
3,Small,1848.0,2625.0,621.0


In [253]:
# Aggregating data using multiple operations
(
    df_fortune_cp
    .groupby('Company_size', as_index=False).agg(
        revenues_min = ('Revenues', 'min'),
        profits_max = ('Profits', 'max')
    )
)

Unnamed: 0,Company_size,revenues_min,profits_max
0,Big,1849.0,21308.0
1,Huge,2792.0,48351.0
2,Medium,1851.0,10222.0
3,Small,1848.0,2625.0


In [66]:
industry_list = ['Industrials', 'Chemicals']
(
    df_fortune_cp
    .query('Sector in @industry_list')
    .groupby('Sector', as_index=False)[['Profits', 'Employees']]
    .mean()
)

Unnamed: 0,Sector,Profits,Employees
0,Chemicals,620.454545,14364.242424
1,Industrials,390.087755,31961.428571


In [67]:
industry_list = ['Industrials', 'Chemicals']
(
    df_fortune_cp
    .query('Sector in @industry_list')
    .groupby(['Sector', 'Profits_neg'], as_index=False)['Employees']
    .mean()
)

Unnamed: 0,Sector,Profits_neg,Employees
0,Chemicals,0,15412.758621
1,Chemicals,1,6762.5
2,Industrials,0,26052.391304
3,Industrials,1,122566.666667


In [68]:
# Applying a function to the grouped data
(
    df_fortune_cp
    .sort_values(by="Profits", ascending=False)
    .groupby('Sector', as_index=False)
    .first()
)

Unnamed: 0,Sector,Company,Revenues,Profits,Employees,Industry,Profits_neg,Expenses,Company_size,Profit_per_employee,Profits_squared,Revenues_log
0,Aerospace & Defense,Boeing,93392.0,8197.0,140800,Aerospace and Defense,0,85195.0,Huge,0.058217,67190810.0,11.444561
1,Apparel,Nike,34350.0,4240.0,74400,Apparel,0,30110.0,Big,0.056989,17977600.0,10.444357
2,Automotive & Transportation,Union Pacific,21240.0,10712.0,41992,Railroads,0,10528.0,Big,0.255096,114746900.0,9.963641
3,Business Services,Visa,18358.0,6699.0,15000,Financial Data Services,0,11659.0,Big,0.4466,44876600.0,9.817821
4,Chemicals,Air Products & Chemicals,8442.0,3000.4,15150,Chemicals,0,5441.6,Big,0.198046,9002400.0,9.040975
5,Energy,Exxon Mobil,244363.0,19710.0,71200,Petroleum Refining,0,224653.0,Big,0.276826,388484100.0,12.40641
6,Engineering & Construction,D.R. Horton,14091.0,1038.4,7735,Homebuilders,0,13052.6,Medium,0.134247,1078275.0,9.553292
7,Financials,Berkshire Hathaway,242137.0,44940.0,377000,Insurance: Property and Casualty (Stock),0,197197.0,Huge,0.119204,2019604000.0,12.397259
8,"Food, Beverages, Drug Stores & Tobacco",Kraft Heinz,26232.0,10999.0,39000,Food Consumer Products,0,15233.0,Big,0.282026,120978000.0,10.174735
9,Health Care,Pfizer,52546.0,21308.0,90200,Pharmaceuticals,0,31238.0,Big,0.236231,454030900.0,10.869444


In [69]:
# Applying a function to the grouped data
industry_list = ['Industrials', 'Chemicals']
(
    df_fortune_cp
    .query('Sector in @industry_list')
    .assign(Profits_abs = lambda x: x.Profits.abs())
    .groupby(['Sector', 'Profits_neg'], as_index=False)
    .apply(lambda x: x.nlargest(1, 'Profits_abs').filter(items=['Sector', 'Profits_neg', 'Company', 'Profits_abs']))
)

Unnamed: 0,Unnamed: 1,Sector,Profits_neg,Company,Profits_abs
0,344,Chemicals,0,Air Products & Chemicals,3000.4
1,623,Chemicals,1,Platform Specialty Products,296.2
2,96,Industrials,0,3M,4858.0
3,17,Industrials,1,General Electric,5786.0


In [73]:
# Use this to check the previous results
# df_fortune_cp.query("Sector == 'Chemicals' and Profits_neg == 1")[["Company","Profits"]].nlargest(1, "Profits")

## 8. Combining Data

### 8.1. Concatenating Data

Pandas' `concat()` function is used to concatenate two or more DataFrames along rows or columns:

In [74]:
# First, create a dataframe besides the one we have
df_tmp = pd.DataFrame({
    "Company": ["Amazon", "Stark Industries"]
})
df_tmp

Unnamed: 0,Company
0,Amazon
1,Stark Industries


In [76]:
# Concatenating DataFrames along rows (default behaviour)
pd.concat([df_fortune_cp, df_tmp], axis="rows", ignore_index=True)

Unnamed: 0,Company,Revenues,Profits,Employees,Sector,Industry,Profits_neg,Expenses,Company_size,Profit_per_employee,Profits_squared,Revenues_log
0,Walmart,500343.0,9862.0,2300000.0,Retailing,General Merchandisers,0.0,490481.0,Huge,0.004288,9.725904e+07,13.123049
1,Exxon Mobil,244363.0,19710.0,71200.0,Energy,Petroleum Refining,0.0,224653.0,Big,0.276826,3.884841e+08,12.406410
2,Berkshire Hathaway,242137.0,44940.0,377000.0,Financials,Insurance: Property and Casualty (Stock),0.0,197197.0,Huge,0.119204,2.019604e+09,12.397259
3,Apple,229234.0,48351.0,123000.0,Tech,"Computers, Office Equipment",0.0,180883.0,Huge,0.393098,2.337819e+09,12.342499
4,UnitedHealth Group,201159.0,10558.0,260000.0,Health Care,Health Care: Insurance and Managed Care,0.0,190601.0,Huge,0.040608,1.114714e+08,12.211851
...,...,...,...,...,...,...,...,...,...,...,...,...
995,CoreLogic,1851.0,152.2,5900.0,Business Services,Financial Data Services,0.0,1698.8,Medium,0.025797,2.316484e+04,7.523481
996,Ensign Group,1849.0,40.5,21301.0,Health Care,Health Care: Medical Facilities,0.0,1808.5,Big,0.001901,1.640250e+03,7.522400
997,HCP,1848.0,414.2,190.0,Financials,Real estate,0.0,1433.8,Small,2.180000,1.715616e+05,7.521859
998,Amazon,,,,,,,,,,,


In [261]:
# Concatenating DataFrames along columns
# pd.concat([df1, df2], axis="columns")

### 8.2. Merging Data

Pandas' `merge()` function is used to merge DataFrames based on a common column or index.

In [77]:
# For convenience, let's create another dataframe
df_tech = df_fortune_cp.query("Sector == 'Tech'").filter(items=["Company"]).copy()
df_tech["Online"] = "Yes"
df_tech

Unnamed: 0,Company,Online
3,Apple,Yes
21,Alphabet,Yes
29,Microsoft,Yes
33,IBM,Yes
34,Dell Technologies,Yes
...,...,...
969,Cadence Design Systems,Yes
970,Nuance Communications,Yes
973,ServiceNow,Yes
983,MKS Instruments,Yes


In [81]:
# Merging DataFrames based on common columns
pd.merge(
    left=df_fortune_cp,
    right=df_tech,
    on=["Company"],
    indicator=True,
    how="left"
)

Unnamed: 0,Company,Revenues,Profits,Employees,Sector,Industry,Profits_neg,Expenses,Company_size,Profit_per_employee,Profits_squared,Revenues_log,Online,_merge
0,Walmart,500343.0,9862.0,2300000,Retailing,General Merchandisers,0,490481.0,Huge,0.004288,9.725904e+07,13.123049,,left_only
1,Exxon Mobil,244363.0,19710.0,71200,Energy,Petroleum Refining,0,224653.0,Big,0.276826,3.884841e+08,12.406410,,left_only
2,Berkshire Hathaway,242137.0,44940.0,377000,Financials,Insurance: Property and Casualty (Stock),0,197197.0,Huge,0.119204,2.019604e+09,12.397259,,left_only
3,Apple,229234.0,48351.0,123000,Tech,"Computers, Office Equipment",0,180883.0,Huge,0.393098,2.337819e+09,12.342499,Yes,both
4,UnitedHealth Group,201159.0,10558.0,260000,Health Care,Health Care: Insurance and Managed Care,0,190601.0,Huge,0.040608,1.114714e+08,12.211851,,left_only
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
993,SiteOne Landscape Supply,1862.0,54.6,3664,Wholesalers,Wholesalers: Diversified,0,1807.4,Medium,0.014902,2.981160e+03,7.529406,,left_only
994,Charles River Laboratories Intl,1858.0,123.4,11800,Health Care,Health Care: Pharmacy and Other Services,0,1734.6,Big,0.010458,1.522756e+04,7.527256,,left_only
995,CoreLogic,1851.0,152.2,5900,Business Services,Financial Data Services,0,1698.8,Medium,0.025797,2.316484e+04,7.523481,,left_only
996,Ensign Group,1849.0,40.5,21301,Health Care,Health Care: Medical Facilities,0,1808.5,Big,0.001901,1.640250e+03,7.522400,,left_only


## 9. Reshaping Data

### 9.1. Pivoting data

Pandas' `pivot()` method is used to reshape data from long to wide format.

In [82]:
# For convenience, let's create an small dataframe in long format
data = {
    'Date': ['2024-04-01', '2024-04-01', '2024-04-02', '2024-04-02', '2024-04-03', '2024-04-03'],
    'Company': ['A', 'B', 'A', 'B', 'A', 'B'],
    'Revenue': [1000, 1500, 1200, 1800, 900, 1300],
    'Profit': [200, 300, 240, 360, 180, 260]
}
df_long = pd.DataFrame(data)
df_long

Unnamed: 0,Date,Company,Revenue,Profit
0,2024-04-01,A,1000,200
1,2024-04-01,B,1500,300
2,2024-04-02,A,1200,240
3,2024-04-02,B,1800,360
4,2024-04-03,A,900,180
5,2024-04-03,B,1300,260


In [84]:
# Reshaping data using pivot()
df_long.pivot(index='Date', columns='Company', values=['Revenue', 'Profit'])

Unnamed: 0_level_0,Revenue,Revenue,Profit,Profit
Company,A,B,A,B
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
2024-04-01,1000,1500,200,300
2024-04-02,1200,1800,240,360
2024-04-03,900,1300,180,260


### 9.2. Melting Data

Pandas' `melt()` method is used to unpivot a DataFrame from wide to long format.

In [85]:
# First, create an example dataframe to melt
df_to_melt = (
    df_fortune_cp
    .groupby('Company_size', as_index=False).agg(
        min = ('Profits', 'min'),
        max = ('Profits', 'max')
    )
)
df_to_melt

Unnamed: 0,Company_size,min,max
0,Big,-6084.0,21308.0
1,Huge,-6798.0,48351.0
2,Medium,-5723.0,10222.0
3,Small,-197.9,2625.0


In [86]:
# Melting data from wide to long format
pd.melt(
    df_to_melt,
    id_vars=['Company_size'], value_vars=['min', 'max'],
    var_name='Stat', value_name='Value'
)

Unnamed: 0,Company_size,Stat,Value
0,Big,min,-6084.0
1,Huge,min,-6798.0
2,Medium,min,-5723.0
3,Small,min,-197.9
4,Big,max,21308.0
5,Huge,max,48351.0
6,Medium,max,10222.0
7,Small,max,2625.0


## 10. Handling Categorical Data

### 10.1. Encoding Categorical Data

Pandas provides a special data type called `category` for representing categorical variables.
Categorical data are variables that take on a limited, fixed number of possible values.

Pandas offers various methods for encoding categorical data, such as `astype('category')`, `Categorical()`, `cut()`, and `get_dummies()`:

In [87]:
# Converting a column to categorical data type using `astype('category')`
df_fortune_cp["Profits_neg"] = df_fortune_cp["Profits_neg"].astype('category')
df_fortune_cp.Profits_neg

0      0
1      0
2      0
3      0
4      0
      ..
995    0
996    0
997    0
998    0
999    0
Name: Profits_neg, Length: 998, dtype: category
Categories (2, int32): [0, 1]

In [88]:
# Converting a column to categorical data type using `Categorical()`
df_fortune_cp["Company_size"] = pd.Categorical(
    df_fortune_cp["Company_size"],
    categories=['Small', 'Medium', 'Big', 'Huge'], ordered=True
)
df_fortune_cp.Company_size

0        Huge
1         Big
2        Huge
3        Huge
4        Huge
        ...  
995    Medium
996       Big
997    Medium
998       Big
999     Small
Name: Company_size, Length: 998, dtype: category
Categories (4, object): ['Small' < 'Medium' < 'Big' < 'Huge']

In [90]:
# Binning numerical data into categorical bins using `cut()`
df_fortune_cp['Expenses_binned'] = pd.cut(df_fortune_cp['Expenses'], bins=[0, 1000, 1000000])
df_fortune_cp.Expenses_binned

0      (1000, 1000000]
1      (1000, 1000000]
2      (1000, 1000000]
3      (1000, 1000000]
4      (1000, 1000000]
            ...       
995    (1000, 1000000]
996    (1000, 1000000]
997    (1000, 1000000]
998    (1000, 1000000]
999    (1000, 1000000]
Name: Expenses_binned, Length: 998, dtype: category
Categories (2, interval[int64, right]): [(0, 1000] < (1000, 1000000]]

In [94]:
df_tmp = df_fortune_cp.loc[0:5, ["Sector"]]
df_tmp

Unnamed: 0,Sector
0,Retailing
1,Energy
2,Financials
3,Tech
4,Health Care
5,Wholesalers


In [93]:
# Creating dummy variables for categorical data
df_tmp = df_fortune_cp.loc[0:5, ["Company", "Sector"]]
pd.get_dummies(df_tmp['Sector'])*1.0

Unnamed: 0,Energy,Financials,Health Care,Retailing,Tech,Wholesalers
0,0.0,0.0,0.0,1.0,0.0,0.0
1,1.0,0.0,0.0,0.0,0.0,0.0
2,0.0,1.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,1.0,0.0
4,0.0,0.0,1.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,1.0


## 11. Working with Text Data

#### 11.1. Introduction to String Methods in Pandas

Pandas provides a wide range of string methods that can be applied to columns containing text data. These methods allow you to manipulate and extract information from text data efficiently.

In [273]:
# Accessing string methods with .str accessor
# df['Text_Column'].str.method_name()

#### 11.2. Commonly Used String Methods

Here are some commonly used string methods in Pandas:

`str.lower()` / `str.upper()`: Converts strings to lowercase or uppercase.

In [95]:
# Converting text to lowercase
df_fortune_cp['Sector'].str.lower()

0              retailing
1                 energy
2             financials
3                   tech
4            health care
             ...        
995          wholesalers
996          health care
997    business services
998          health care
999           financials
Name: Sector, Length: 998, dtype: object

In [96]:
# Converting text to upprcase
df_fortune_cp['Sector'].str.upper()

0              RETAILING
1                 ENERGY
2             FINANCIALS
3                   TECH
4            HEALTH CARE
             ...        
995          WHOLESALERS
996          HEALTH CARE
997    BUSINESS SERVICES
998          HEALTH CARE
999           FINANCIALS
Name: Sector, Length: 998, dtype: object

`str.capitalize()` / `str.title()`: Capitalizes the first character of each word or the first character of the string.

In [97]:
# Capitalizing the first character of each word
df_fortune_cp['Industry'].str.title()

0                         General Merchandisers
1                            Petroleum Refining
2      Insurance: Property And Casualty (Stock)
3                   Computers, Office Equipment
4       Health Care: Insurance And Managed Care
                         ...                   
995                    Wholesalers: Diversified
996    Health Care: Pharmacy And Other Services
997                     Financial Data Services
998             Health Care: Medical Facilities
999                                 Real Estate
Name: Industry, Length: 998, dtype: object

In [98]:
# Capitalizing the first character of each text
df_fortune_cp['Industry'].str.capitalize()

0                         General merchandisers
1                            Petroleum refining
2      Insurance: property and casualty (stock)
3                   Computers, office equipment
4       Health care: insurance and managed care
                         ...                   
995                    Wholesalers: diversified
996    Health care: pharmacy and other services
997                     Financial data services
998             Health care: medical facilities
999                                 Real estate
Name: Industry, Length: 998, dtype: object

`str.strip()`: Removes leading and trailing whitespaces from strings.

In [99]:
# Removing leading and trailing whitespaces
df_fortune_cp['Industry'].str.strip()

0                         General Merchandisers
1                            Petroleum Refining
2      Insurance: Property and Casualty (Stock)
3                   Computers, Office Equipment
4       Health Care: Insurance and Managed Care
                         ...                   
995                    Wholesalers: Diversified
996    Health Care: Pharmacy and Other Services
997                     Financial Data Services
998             Health Care: Medical Facilities
999                                 Real estate
Name: Industry, Length: 998, dtype: object

`str.contains()`: Checks if each string contains a substring or pattern.

In [114]:
# Checking if a substring is present in text
# (commonly used for boolean indexing)
df_fortune_cp[df_fortune_cp['Industry'].str.contains(r'Internet|Data', case=False, na=False)].head()

Unnamed: 0,Company,Revenues,Profits,Employees,Sector,Industry,Profits_neg,Expenses,Company_size,Profit_per_employee,Profits_squared,Revenues_log,Expenses_binned
7,Amazon.com,177866.0,3033.0,566000,Retailing,Internet Services and Retailing,0,174833.0,Huge,0.005359,9199089.0,12.088786,"(1000, 1000000]"
21,Alphabet,110855.0,12662.0,80110,Tech,Internet Services and Retailing,0,98193.0,Big,0.158058,160326244.0,11.615978,"(1000, 1000000]"
75,Facebook,40653.0,15934.0,25105,Tech,Internet Services and Retailing,0,24719.0,Big,0.634694,253892356.0,10.612828,"(1000, 1000000]"
160,Visa,18358.0,6699.0,15000,Business Services,Financial Data Services,0,11659.0,Big,0.4466,44876601.0,9.817821,"(1000, 1000000]"
221,PayPal Holdings,13094.0,1795.0,18700,Business Services,Financial Data Services,0,11299.0,Big,0.095989,3222025.0,9.479909,"(1000, 1000000]"


`str.startswith(prefix)` / `str.endswith(suffix)` : Checks if each string starts/ends with the specified prefix/suffix.

In [104]:
# Checking if each string starts with a prefix
# (commonly used for boolean indexing)
df_fortune_cp[df_fortune_cp['Industry'].str.startswith('Health')].head()

Unnamed: 0,Company,Revenues,Profits,Employees,Sector,Industry,Profits_neg,Expenses,Company_size,Profit_per_employee,Profits_squared,Revenues_log,Expenses_binned
4,UnitedHealth Group,201159.0,10558.0,260000,Health Care,Health Care: Insurance and Managed Care,0,190601.0,Huge,0.040608,111471400.0,12.211851,"(1000, 1000000]"
6,CVS Health,184765.0,6622.0,203000,Health Care,Health Care: Pharmacy and Other Services,0,178143.0,Huge,0.032621,43850880.0,12.12684,"(1000, 1000000]"
24,Express Scripts Holding,100065.0,4517.4,26600,Health Care,Health Care: Pharmacy and Other Services,0,95547.6,Big,0.169827,20406900.0,11.513575,"(1000, 1000000]"
28,Anthem,90040.0,3842.8,56000,Health Care,Health Care: Insurance and Managed Care,0,86197.2,Big,0.068621,14767110.0,11.408009,"(1000, 1000000]"
48,Aetna,60535.0,1904.0,47950,Health Care,Health Care: Insurance and Managed Care,0,58631.0,Big,0.039708,3625216.0,11.010977,"(1000, 1000000]"


In [106]:
# Checking if each string starts with a prefix
# (commonly used for boolean indexing)
df_fortune_cp[df_fortune_cp['Industry'].str.endswith('Care')].head()

Unnamed: 0,Company,Revenues,Profits,Employees,Sector,Industry,Profits_neg,Expenses,Company_size,Profit_per_employee,Profits_squared,Revenues_log,Expenses_binned
4,UnitedHealth Group,201159.0,10558.0,260000,Health Care,Health Care: Insurance and Managed Care,0,190601.0,Huge,0.040608,111471400.0,12.211851,"(1000, 1000000]"
5,McKesson,198533.0,5070.0,64500,Wholesalers,Wholesalers: Health Care,0,193463.0,Big,0.078605,25704900.0,12.198711,"(1000, 1000000]"
11,AmerisourceBergen,153144.0,364.5,19500,Wholesalers,Wholesalers: Health Care,0,152779.5,Big,0.018692,132860.2,11.939134,"(1000, 1000000]"
13,Cardinal Health,129976.0,1288.0,40400,Wholesalers,Wholesalers: Health Care,0,128688.0,Big,0.031881,1658944.0,11.775105,"(1000, 1000000]"
28,Anthem,90040.0,3842.8,56000,Health Care,Health Care: Insurance and Managed Care,0,86197.2,Big,0.068621,14767110.0,11.408009,"(1000, 1000000]"


`str.replace()`: Replaces occurrences of a pattern with another string.

In [108]:
# Replacing text with another text
df_fortune_cp['Industry'].str.replace(' ', '_')

0                         General_Merchandisers
1                            Petroleum_Refining
2      Insurance:_Property_and_Casualty_(Stock)
3                   Computers,_Office_Equipment
4       Health_Care:_Insurance_and_Managed_Care
                         ...                   
995                    Wholesalers:_Diversified
996    Health_Care:_Pharmacy_and_Other_Services
997                     Financial_Data_Services
998             Health_Care:_Medical_Facilities
999                                 Real_estate
Name: Industry, Length: 998, dtype: object

In [109]:
# Replacing text with another text
df_fortune_cp['Industry'].str.replace(':', ' - ')

0                           General Merchandisers
1                              Petroleum Refining
2      Insurance -  Property and Casualty (Stock)
3                     Computers, Office Equipment
4       Health Care -  Insurance and Managed Care
                          ...                    
995                    Wholesalers -  Diversified
996    Health Care -  Pharmacy and Other Services
997                       Financial Data Services
998             Health Care -  Medical Facilities
999                                   Real estate
Name: Industry, Length: 998, dtype: object

In [111]:
# Replacing text with another text
df_fortune_cp['Industry'].str.replace(r'^.+:', '', regex=True)

0               General Merchandisers
1                  Petroleum Refining
2       Property and Casualty (Stock)
3         Computers, Office Equipment
4          Insurance and Managed Care
                    ...              
995                       Diversified
996       Pharmacy and Other Services
997           Financial Data Services
998                Medical Facilities
999                       Real estate
Name: Industry, Length: 998, dtype: object