# Pandas basics
---

`NumPy` provides fundamental structures and tools that makes working with data easier, but there are several things that limit its usefulness as a single tool when working with data:

* The lack of support for column names forces us to frame the questions we want to answer as multi-dimensional array operations.
* Support for only one data type per ndarray makes it more difficult to work with data that contains both numeric and string data.
* There are lots of low level methods, however there are many common analysis patterns that don't have pre-built methods.

The `pandas` library provides solutions to all of these pain points and more. Pandas is not so much a replacement for NumPy as an extension of NumPy.

In [1]:
import pandas as pd

With pandas we'll inspect the f500.csv dataset. Pandas has a method to read csv ([`pandas.read_csv()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)). Pandas has a similar attribute `.shape`, like NumPy.

In [2]:
import pandas as pd
f500 = pd.read_csv("f500.csv", index_col=0)
f500.index.name = None # deletes the name of the file from the table, it's optional
f500_type = type(f500)
f500_shape = f500.shape

In [3]:
print(f500_type, f500_shape)

<class 'pandas.core.frame.DataFrame'> (500, 16)


[`pandas.DataFrame()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html#pandas.DataFrame) objects/type, or just dataframes, is the primary pandas data structure (the second one is series). DataFrames have 2 dimensions (rows, columns).

We can use the [`DataFrame.dtypes`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dtypes.html#pandas.DataFrame.dtypes) attribute (similar to NumPy's [`ndarray.dtype`](http://docs.scipy.org/doc/numpy-1.14.2/reference/generated/numpy.ndarray.dtype.html#numpy.ndarray.dtype) attribute) to return information about the types of each column. Let's see what this would return for our selection of data above:

In [4]:
print(f500.dtypes)

rank                          int64
revenues                      int64
revenue_change              float64
profits                     float64
assets                        int64
profit_change               float64
ceo                          object
industry                     object
sector                       object
previous_rank                 int64
country                      object
hq_location                  object
website                      object
years_on_global_500_list      int64
employees                     int64
total_stockholder_equity      int64
dtype: object


A few handy methods we can use to get some high-level information about our dataframe:

* If we wanted to view the first few rows of our dataframe, we can use the [`DataFrame.head()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html) method, which returns the first five rows of our dataframe. The `DataFrame.head()` method also accepts an optional integer parameter which specifies the number of rows. We could use `f500.head(10)` to return the first 10 rows of our f500 dataframe.
* Similar in function to `DataFrame.head()`, we can use the [`DataFrame.tail()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.tail.html) method, to shows us the last rows of our dataframe. The `DataFrame.tail()` method accepts an optional integer parameter to specify the number of rows, defaulting to five.
* If we wanted to get an overview of all the dtypes used in our dataframe, along with its shape and some extra information, we could use the [`DataFrame.info()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.info.html#pandas.DataFrame.info) method. Note that `DataFrame.info()` prints the information, rather than returning it, so we can't assign it to a variable.

Let's practice using these three new methods:

In [5]:
f500_head = f500.head(6)
f500_tail = f500.tail(8)
f500.info()

<class 'pandas.core.frame.DataFrame'>
Index: 500 entries, Walmart to AutoNation
Data columns (total 16 columns):
rank                        500 non-null int64
revenues                    500 non-null int64
revenue_change              498 non-null float64
profits                     499 non-null float64
assets                      500 non-null int64
profit_change               436 non-null float64
ceo                         500 non-null object
industry                    500 non-null object
sector                      500 non-null object
previous_rank               500 non-null int64
country                     500 non-null object
hq_location                 500 non-null object
website                     500 non-null object
years_on_global_500_list    500 non-null int64
employees                   500 non-null int64
total_stockholder_equity    500 non-null int64
dtypes: float64(3), int64(7), object(6)
memory usage: 66.4+ KB


In [6]:
f500_head

Unnamed: 0,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity
Walmart,1,485873,0.8,13643.0,198825,-7.2,C. Douglas McMillon,General Merchandisers,Retailing,1,USA,"Bentonville, AR",http://www.walmart.com,23,2300000,77798
State Grid,2,315199,-4.4,9571.3,489838,-6.2,Kou Wei,Utilities,Energy,2,China,"Beijing, China",http://www.sgcc.com.cn,17,926067,209456
Sinopec Group,3,267518,-9.1,1257.9,310726,-65.0,Wang Yupu,Petroleum Refining,Energy,4,China,"Beijing, China",http://www.sinopec.com,19,713288,106523
China National Petroleum,4,262573,-12.3,1867.5,585619,-73.7,Zhang Jianhua,Petroleum Refining,Energy,3,China,"Beijing, China",http://www.cnpc.com.cn,17,1512048,301893
Toyota Motor,5,254694,7.7,16899.3,437575,-12.3,Akio Toyoda,Motor Vehicles and Parts,Motor Vehicles & Parts,8,Japan,"Toyota, Japan",http://www.toyota-global.com,23,364445,157210
Volkswagen,6,240264,1.5,5937.3,432116,,Matthias Muller,Motor Vehicles and Parts,Motor Vehicles & Parts,7,Germany,"Wolfsburg, Germany",http://www.volkswagen.com,23,626715,97753


In [7]:
f500_tail

Unnamed: 0,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity
Telecom Italia,493,21941,-17.4,1999.4,74295,,Flavio Cattaneo,Telecommunications,Telecommunications,404,Italy,"Milan, Italy",http://www.telecomitalia.com,18,61227,22366
Xiamen ITG Holding Group,494,21930,34.3,35.6,12161,-25.1,Xu Xiaoxi,Trading,Wholesalers,0,China,"Xiamen, China",http://www.itgholding.com.cn,1,18454,1066
Xinjiang Guanghui Industry Investment,495,21919,31.1,251.8,31957,49.9,Shang Jiqiang,Trading,Wholesalers,0,China,"Urumqi, China",http://www.guanghui.com,1,65616,4563
Teva Pharmaceutical Industries,496,21903,11.5,329.0,92890,-79.3,Yitzhak Peterburg,Pharmaceuticals,Health Care,0,Israel,"Petach Tikva, Israel",http://www.tevapharm.com,1,56960,33337
New China Life Insurance,497,21796,-13.3,743.9,100609,-45.6,Wan Feng,"Insurance: Life, Health (stock)",Financials,427,China,"Beijing, China",http://www.newchinalife.com,2,54378,8507
Wm. Morrison Supermarkets,498,21741,-11.3,406.4,11630,20.4,David T. Potts,Food and Drug Stores,Food & Drug Stores,437,Britain,"Bradford, Britain",http://www.morrisons.com,13,77210,5111
TUI,499,21655,-5.5,1151.7,16247,195.5,Friedrich Joussen,Travel Services,Business Services,467,Germany,"Hanover, Germany",http://www.tuigroup.com,23,66779,3006
AutoNation,500,21609,3.6,430.5,10060,-2.7,Michael J. Jackson,Specialty Retailers,Retailing,0,USA,"Fort Lauderdale, FL",http://www.autonation.com,12,26000,2310


Because our axes in pandas have labels, we can select data using those labels, unlike in NumPy where we needed to know the exact index location. To do this, we use the [`DataFrame.loc[]`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html#pandas.DataFrame.loc) method.

The syntax for the `DataFrame.loc[]` method is:

```python
df.loc[row, column]
```

Where row and column refer to row and column labels respectively, and can be one of:

* A single label
* A list or array of labels
* A slice object with labels
* A boolean array

Here's an example of using this method:

In [8]:
industries = f500.loc[:,"industry"]
previous = f500.loc[:,["rank","previous_rank","years_on_global_500_list"]]
financial_data = f500.loc[:,"revenues":"profit_change"]

In [9]:
industries.head()

Walmart                        General Merchandisers
State Grid                                 Utilities
Sinopec Group                     Petroleum Refining
China National Petroleum          Petroleum Refining
Toyota Motor                Motor Vehicles and Parts
Name: industry, dtype: object

In [10]:
previous.head()

Unnamed: 0,rank,previous_rank,years_on_global_500_list
Walmart,1,1,23
State Grid,2,2,17
Sinopec Group,3,4,19
China National Petroleum,4,3,17
Toyota Motor,5,8,23


In [11]:
financial_data.head()

Unnamed: 0,revenues,revenue_change,profits,assets,profit_change
Walmart,485873,0.8,13643.0,198825,-7.2
State Grid,315199,-4.4,9571.3,489838,-6.2
Sinopec Group,267518,-9.1,1257.9,310726,-65.0
China National Petroleum,262573,-12.3,1867.5,585619,-73.7
Toyota Motor,254694,7.7,16899.3,437575,-12.3


Instead of `df.loc[:,"col1"]` you can use `df["col1"]` to select columns. This works for single columns and lists of columns but not for column slices.

Instead of `df.loc[:,"col1"]` you can use `df.col1`. This shortcut does not work for labels that contain spaces or special characters.

Here are some examples:

In [12]:
countries = f500["country"]
revenues_years = f500[["revenues", "years_on_global_500_list"]]
ceo_to_sector = f500.loc[:,"ceo":"sector"]

**Series** is the pandas type for one-dimensional objects. Anytime you see a 1D pandas object, it will be a series, and anytime you see a 2D pandas object, it will be a dataframe.

A dataframe can be seen as being a collection of series objects, which is similar to how pandas stores the data behind the scenes. Single row or single column can be seen as series objects if they contain one dtype. Series objects also contain indeces, so we can use `.loc[]` method as well (though using just square brackets will be more convenient). Here's an example:

In [13]:
ceos = f500['ceo']
ceos = f500["ceo"]
walmart = ceos["Walmart"]
apple_to_samsung = ceos["Apple":"Samsung Electronics"]
oil_companies = ceos[["Exxon Mobil", "BP", "Chevron"]]

Let's see the type of a row and a column from our dataset:

In [14]:
a_row = f500.loc["Walmart"]
a_column = f500.loc[:, "revenues"]
print(type(a_row))
print(type(a_column))
print(a_row)

<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
rank                                             1
revenues                                    485873
revenue_change                                 0.8
profits                                      13643
assets                                      198825
profit_change                                 -7.2
ceo                            C. Douglas McMillon
industry                     General Merchandisers
sector                                   Retailing
previous_rank                                    1
country                                        USA
hq_location                        Bentonville, AR
website                     http://www.walmart.com
years_on_global_500_list                        23
employees                                  2300000
total_stockholder_equity                     77798
Name: Walmart, dtype: object


We can see that `a_row` is a series object and the dtype of `a_row` is `object`, which means that all the items from the row were transformed to string (in order to have only one data type in the whole series object). Let's use all of these methods and try to extract some rows, columns and combinations of them:

In [15]:
drink_companies = f500.loc[["Anheuser-Busch InBev", "Coca-Cola", "Heineken Holding"], :]
big_movers = f500.loc[["Aviva", "HP", "JD.com", "BHP Billiton"], ["rank", "previous_rank"]]
middle_companies = f500.loc["Tata Motors":"Nationwide", "rank":"country"]

How to quickly analyze a dataset? First of all, we can use [`Series.describe()`](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.describe.html) method, which returns some descriptive statistics on the data contained within a specific pandas series. Let's look at an example:

In [16]:
revs = f500["revenues"]
revs.describe()

count       500.000000
mean      55416.358000
std       45725.478963
min       21609.000000
25%       29003.000000
50%       40236.000000
75%       63926.750000
max      485873.000000
Name: revenues, dtype: float64

The method tells us how many non-null values are contained in the series, the mean and [standard deviation](https://www.quora.com/What-is-standard-deviation-1), along with the minimum, maximum and [quartile](https://en.wikipedia.org/wiki/Quartile) values.

We can also skip variable declaration and just use the `.describe()` method using `.loc[]` (or just square brackets) method:

```python
f500["revenues"].describe()
```

In [17]:
f500["assets"].describe()

count    5.000000e+02
mean     2.436323e+05
std      4.851937e+05
min      3.717000e+03
25%      3.658850e+04
50%      7.326150e+04
75%      1.805640e+05
max      3.473238e+06
Name: assets, dtype: float64

Because the values for this column are too long to display neatly, pandas has displayed them in **E-notation**, a type of scientific notation. 

Basically, in `N.nnnnnne+PP`:

* `N` means *integer*
* `n` means *decimal*
* `e` means *ten*
* `+PP` means the *power* of ten

So `2.436323e+05` is the same as `2.436323 * 10 ** 5`.

We can use `.describe()` method on a series object with strings, the result will be different:

In [18]:
f500["country"].describe()

count     500
unique     34
top       USA
freq      132
Name: country, dtype: object

DataFrame objects also have a [DataFrame.describe()](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html) method that returns these same statistics for every column. One difference is that you need to specify manually if you want to see the statistics for the non-numeric columns. By default, `DataFrame.describe()` will return statistics for only numeric columns. If we wanted to get just the object columns, we need to use the `include=['O']` parameter when using the dataframe version of describe:

In [19]:
f500.describe(include=["O"])

Unnamed: 0,ceo,industry,sector,country,hq_location,website
count,500,500,500,500,500,500
unique,500,58,21,34,235,500
top,Gao Hongwei,Banks: Commercial and Savings,Financials,USA,"Beijing, China",http://www.lyb.com
freq,1,51,118,132,56,1


In [20]:
f500.describe()

Unnamed: 0,rank,revenues,revenue_change,profits,assets,profit_change,previous_rank,years_on_global_500_list,employees,total_stockholder_equity
count,500.0,500.0,498.0,499.0,500.0,436.0,500.0,500.0,500.0,500.0
mean,250.5,55416.358,4.538353,3055.203206,243632.3,24.152752,222.134,15.036,133998.3,30628.076
std,144.481833,45725.478963,28.549067,5171.981071,485193.7,437.509566,146.941961,7.932752,170087.8,43642.576833
min,1.0,21609.0,-67.3,-13038.0,3717.0,-793.7,0.0,1.0,328.0,-59909.0
25%,125.75,29003.0,-5.9,556.95,36588.5,-22.775,92.75,7.0,42932.5,7553.75
50%,250.5,40236.0,0.55,1761.6,73261.5,-0.35,219.5,17.0,92910.5,15809.5
75%,375.25,63926.75,6.975,3954.0,180564.0,17.7,347.25,23.0,168917.2,37828.5
max,500.0,485873.0,442.3,45687.0,3473238.0,8909.5,500.0,23.0,2300000.0,301893.0


In [21]:
f500.describe(include="all")

Unnamed: 0,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity
count,500.0,500.0,498.0,499.0,500.0,436.0,500,500,500,500.0,500,500,500,500.0,500.0,500.0
unique,,,,,,,500,58,21,,34,235,500,,,
top,,,,,,,Gao Hongwei,Banks: Commercial and Savings,Financials,,USA,"Beijing, China",http://www.lyb.com,,,
freq,,,,,,,1,51,118,,132,56,1,,,
mean,250.5,55416.358,4.538353,3055.203206,243632.3,24.152752,,,,222.134,,,,15.036,133998.3,30628.076
std,144.481833,45725.478963,28.549067,5171.981071,485193.7,437.509566,,,,146.941961,,,,7.932752,170087.8,43642.576833
min,1.0,21609.0,-67.3,-13038.0,3717.0,-793.7,,,,0.0,,,,1.0,328.0,-59909.0
25%,125.75,29003.0,-5.9,556.95,36588.5,-22.775,,,,92.75,,,,7.0,42932.5,7553.75
50%,250.5,40236.0,0.55,1761.6,73261.5,-0.35,,,,219.5,,,,17.0,92910.5,15809.5
75%,375.25,63926.75,6.975,3954.0,180564.0,17.7,,,,347.25,,,,23.0,168917.2,37828.5


One more thing: the `DataFrame.describe()` method returns a DataFrame object, while the `Series.describe()` method returns a Series object.

Here's some examples of implementing the methods:

In [22]:
profits_desc = f500["profits"].describe()
revenue_and_employees_desc = f500[["revenues", "employees"]].describe()
all_desc = f500.describe(include="all")

As pandas library uses NumPy a lot, panas' objects have many similar methods (like `.max()`, `.min()`, `.mean()`, `.median()`, `.mode()`, `.sum()`, etc.). Vectorized operations work with Series as well.

[`Series.value_counts()`](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html) is a useful method, which displays each unique non-null value from a series, with a count of the number of times that value is used:

In [23]:
top5_countries = f500["country"].value_counts().head()
top5_previous_rank = f500["previous_rank"].value_counts().head()
max_f500 = f500.max()
similar_max_f500 = f500.max(axis = 0, numeric_only = True) # basically same as max_f500

Assigning values in pandas works almost the same as in NumPy. Here are some practical examples:

In [24]:
f500.loc[:, "revenues_b"] = f500.loc[:, "revenues"] / 1000 # adding a new column
f500.loc["Dow Chemical", "ceo"] = "Jim Fitterling"

In [25]:
kr_bool = f500["country"] == "South Korea" # creates a boolean Series object
top_5_kr = f500[kr_bool].head() # shows top 5 Korean companies
top_5_kr_ranks = f500.loc[kr_bool, "rank"].head() # shows ranks of these companies
top_5_kr_ranks = top_5_kr["rank"] # same, but more efficient

By the way, remember the `f500_tail`? Let's inspect the `previous_rank` column there (because actually there are some problems with values):

In [26]:
f500_tail

Unnamed: 0,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity
Telecom Italia,493,21941,-17.4,1999.4,74295,,Flavio Cattaneo,Telecommunications,Telecommunications,404,Italy,"Milan, Italy",http://www.telecomitalia.com,18,61227,22366
Xiamen ITG Holding Group,494,21930,34.3,35.6,12161,-25.1,Xu Xiaoxi,Trading,Wholesalers,0,China,"Xiamen, China",http://www.itgholding.com.cn,1,18454,1066
Xinjiang Guanghui Industry Investment,495,21919,31.1,251.8,31957,49.9,Shang Jiqiang,Trading,Wholesalers,0,China,"Urumqi, China",http://www.guanghui.com,1,65616,4563
Teva Pharmaceutical Industries,496,21903,11.5,329.0,92890,-79.3,Yitzhak Peterburg,Pharmaceuticals,Health Care,0,Israel,"Petach Tikva, Israel",http://www.tevapharm.com,1,56960,33337
New China Life Insurance,497,21796,-13.3,743.9,100609,-45.6,Wan Feng,"Insurance: Life, Health (stock)",Financials,427,China,"Beijing, China",http://www.newchinalife.com,2,54378,8507
Wm. Morrison Supermarkets,498,21741,-11.3,406.4,11630,20.4,David T. Potts,Food and Drug Stores,Food & Drug Stores,437,Britain,"Bradford, Britain",http://www.morrisons.com,13,77210,5111
TUI,499,21655,-5.5,1151.7,16247,195.5,Friedrich Joussen,Travel Services,Business Services,467,Germany,"Hanover, Germany",http://www.tuigroup.com,23,66779,3006
AutoNation,500,21609,3.6,430.5,10060,-2.7,Michael J. Jackson,Specialty Retailers,Retailing,0,USA,"Fort Lauderdale, FL",http://www.autonation.com,12,26000,2310


We can see that some of the companies have a 0 value as their previous rank, which probably means that they were not included in the previous Fortune 500 list. This can cause some problems if we calculate some statistics using these values, so it will be wiser to change 0's to `None`'s (or `np.nan`'s).

In [27]:
import numpy as np
prev_rank_before = f500["previous_rank"].value_counts(dropna=False).head()
f500.loc[f500["previous_rank"] == 0, "previous_rank"] = np.nan # don't forget to select the column or this will change values of every column!
prev_rank_after = f500["previous_rank"].value_counts(dropna=False).head()

In [28]:
prev_rank_after # integers are changed to floats because of using NumPy NaN

NaN       33
 471.0     1
 234.0     1
 125.0     1
 166.0     1
Name: previous_rank, dtype: int64

In [29]:
prev_rank_before

0      33
159     1
147     1
148     1
149     1
Name: previous_rank, dtype: int64

Now, let's summarize everything we learnt by calculating a specific statistic or attribute of each of the three most common countries from our f500 dataframe:

First, let's see the three most common countries from the dataframe:

In [30]:
top_3_countries = f500["country"].value_counts().head(3)
top_3_countries

USA      132
China    109
Japan     51
Name: country, dtype: int64

In [31]:
top_3_countries = f500["country"].value_counts().head(3)
cities_usa = f500.loc[f500["country"] == "USA", "hq_location"].value_counts().head() 
# creates a series containing counts of the five most common
# Headquarter Location cities for companies headquartered in the USA
sector_china = f500.loc[f500["country"] == "China", "sector"].value_counts().head(3) 
# creates a series containing counts of the three most common sectors
# for companies headquartered in the China
mean_employees_japan = f500.loc[f500["country"] == "Japan", "employees"].mean() 
# creates a float object containing the mean average number of employees
#for companies headquartered in Japan

That's it for now! Check the [FAQ](http://pandas.pydata.org/pandas-docs/stable/user_guide/gotchas.html) and the documentation if you have questions.

---

## Exploring data with Pandas

In pandas, each axis has labels, but in some scenarios, like specifying specific columns, using labels to make selections makes things easier - in others though, it makes things harder. If you wanted to select the tenth to twentieth rows in a dataframe, you'd need to know their labels first. So how do we use indeces in dataframes?..

Here's how! Just use [`Series.iloc[]`]() or [`DataFrame.iloc[]`]() instead of `.loc[]` methods! "i" in "iloc" means index! Easy! Even the syntax is the same!

```python
df.iloc[row, column]
```

Let's have some practice with these methods:

In [32]:
import pandas as pd
import numpy as np

f500 = pd.read_csv("f500.csv", index_col=0)
f500.index.name = None
f500.loc[f500["previous_rank"] == 0, "previous_rank"] = np.nan
fifth_row = f500.iloc[4]
first_three_rows = f500.iloc[:3]
first_seventh_row_slice = f500.iloc[[0, 6], :5]

f500 = pd.read_csv("f500.csv") # more common way to read a dataset
f500.loc[f500["previous_rank"] == 0, "previous_rank"] = np.nan

Sometimes the `.iloc[]` method and the `.loc[]` method work just the same: if you didn't assign the index column, then the methods will have integers as it's indeces and labels. But if you somehow change the initial order of the data, then these methods will be different. For example, you could've sorted the data in a different way, like this (we'll use the [`pandas.DataFrame.sort_values()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html) method:

In [33]:
sorted_emp = f500.sort_values(by = "employees", ascending = False)
top5_emp = sorted_emp.iloc[:5] # using .loc[] will invoke an error

Here are several functions that will help during the analysis of data.

* [`Series.str.contains()`](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html)
* [`Series.str.endswith()`](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.endswith.html)
* [`Series.str.isnull()`](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.isnull.html)
* [`Series.str.notnull()`](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.notnull.html)

The main advantage against the usual string Python methods is that the methods above use vector operations. Which means that they work **a lot** faster.

In [34]:
previously_ranked = f500[f500["previous_rank"].notnull()]
rank_change = previously_ranked["rank"] - previously_ranked["previous_rank"]

Boolean indexing is a powerful tool which allows us to select or exclude parts of our data based on their values to perform analysis. Sometimes we'll need to use boolean operators like `and`, `or`, `not`. Pandas syntax looks a little bit different. In pandas:

* `and` is `&`
* `or` is `|`
* `not` is `~`

Below are some practice examples:

In [35]:
cols = ["company",
       "revenues",
       "country"]
f500_sel = f500[cols].head() # created a small piece of the dataframe for practice examples
final_cols = ["revenues", "country"]
# longer
over_265 = f500_sel["revenues"]
over_265_bool = over_265 > 265000
china = f500_sel["country"] == "China"
combined = over_265 & china
result = f500_sel.loc[combined,final_cols]
# shorter
result = f500_sel.loc[(f500_sel["revenues"] > 265000) & (f500_sel["country"] == "China"), final_cols]

In [36]:
result

Unnamed: 0,revenues,country
1,315199,China
2,267518,China


In [37]:
# companies with revenues over 100 billion and negative profits
big_rev_neg_profit = f500[(f500["revenues"] > 100000) & (f500["profits"] < 0)]
# the first 5 companies in the Technology sector that are not headquartered in the USA
tech_outside_usa = f500[(f500["country"] != "USA") & (f500["sector"] == "Technology")].head()

In [38]:
big_rev_neg_profit

Unnamed: 0,company,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity
32,Japan Post Holdings,33,122990,3.6,-267.4,2631385,-107.5,Masatsugu Nagato,"Insurance: Life, Health (stock)",Financials,37.0,Japan,"Tokyo, Japan",http://www.japanpost.jp,21,248384,91532
44,Chevron,45,107567,-18.0,-497.0,260078,-110.8,John S. Watson,Petroleum Refining,Energy,31.0,USA,"San Ramon, CA",http://www.chevron.com,23,55200,145556


In [39]:
tech_outside_usa

Unnamed: 0,company,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity
14,Samsung Electronics,15,173957,-2.0,19316.5,217104,16.8,Oh-Hyun Kwon,"Electronics, Electrical Equip.",Technology,13.0,South Korea,"Suwon, South Korea",http://www.samsung.com,23,325000,154376
26,Hon Hai Precision Industry,27,135129,-4.3,4608.8,80436,-0.4,Terry Gou,"Electronics, Electrical Equip.",Technology,25.0,Taiwan,"New Taipei City, Taiwan",http://www.foxconn.com,13,726772,33476
70,Hitachi,71,84558,1.2,2134.3,86742,48.8,Toshiaki Higashihara,"Electronics, Electrical Equip.",Technology,79.0,Japan,"Tokyo, Japan",http://www.hitachi.com,23,303887,26632
82,Huawei Investment & Holding,83,78511,24.9,5579.4,63837,-5.0,Ren Zhengfei,Network and Other Communications Equipment,Technology,129.0,China,"Shenzhen, China",http://www.huawei.com,8,180000,20159
104,Sony,105,70170,3.9,676.4,158519,-45.1,Kazuo Hirai,"Electronics, Electrical Equip.",Technology,113.0,Japan,"Tokyo, Japan",http://www.sony.net,23,128400,22415


You can add Series to a DataFrame (or other Series) as a new column. If the Series has the same indeces then it'll adjust the order according to the DataFrame (or the Series). If some indeces in Series are missing then the column will have `NaN` value in the missing indeces.

In [40]:
f500["rank_change"] = rank_change
f500["rank_change"].tail()

495     NaN
496    70.0
497    61.0
498    32.0
499     NaN
Name: rank_change, dtype: float64

In pandas we want to avoid loops because of the vectorized methods. But sometimes we can't avoid them. One scenario where it is useful to use loops with pandas is when we are performing **aggregation**. Aggregation is where we apply a statistical operation to *groups* of our data.

Here's an example using [`Series.unique()`](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.unique.html) and [`DataFrame.sort_values()`](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html) methods (we calculate the company that employs the most people in each country):

In [41]:
top_employer_by_country = {}
countries = f500["country"].unique()
for country in countries:
    top_employer_by_country[country] = f500[f500["country"] == country].sort_values(
        by = ["employees"], ascending = False).iloc[0, 0]

In [42]:
top_employer_by_country

{'USA': 'Walmart',
 'China': 'China National Petroleum',
 'Japan': 'Toyota Motor',
 'Germany': 'Volkswagen',
 'Netherlands': 'EXOR Group',
 'Britain': 'Compass Group',
 'South Korea': 'Samsung Electronics',
 'Switzerland': 'Nestle',
 'France': 'Sodexo',
 'Taiwan': 'Hon Hai Precision Industry',
 'Singapore': 'Flex',
 'Italy': 'Poste Italiane',
 'Russia': 'Gazprom',
 'Spain': 'Banco Santander',
 'Brazil': 'JBS',
 'Mexico': 'America Movil',
 'Luxembourg': 'ArcelorMittal',
 'India': 'State Bank of India',
 'Malaysia': 'Petronas',
 'Thailand': 'PTT',
 'Australia': 'Wesfarmers',
 'Belgium': 'Anheuser-Busch InBev',
 'Norway': 'Statoil',
 'Canada': 'George Weston',
 'Ireland': 'Accenture',
 'Indonesia': 'Pertamina',
 'Denmark': 'Maersk Group',
 'Saudi Arabia': 'SABIC',
 'Sweden': 'H & M Hennes & Mauritz',
 'Finland': 'Nokia',
 'Venezuela': 'Mercantil Servicios Financieros',
 'Turkey': 'Koc Holding',
 'U.A.E': 'Emirates Group',
 'Israel': 'Teva Pharmaceutical Industries'}

Now we're going to add a new column to our dataframe, and then perform some aggregation using that new column.

The column we create is going to contain a metric called [return on assets](https://www.inc.com/encyclopedia/return-on-assets-roa.html) (ROA). ROA is a business-specific metric which indicates a companies ability to make profit using their available assets.

* *ROA = PROFIT / ASSETS

Once we've created the new column, we'll aggregate by sector, and find the company with the highest ROA from each sector:

In [43]:
f500["roa"] = f500["profits"] / f500["assets"]
top_roa_by_sector = {}
sectors = f500["sector"].unique()
for sector in sectors:
    top_roa_by_sector[sector] = f500[f500["sector"] == sector].sort_values(
        by = ["roa"], ascending = False).iloc[0, 0]

In [44]:
top_roa_by_sector

{'Retailing': 'H & M Hennes & Mauritz',
 'Energy': 'National Grid',
 'Motor Vehicles & Parts': 'Subaru',
 'Financials': 'Berkshire Hathaway',
 'Technology': 'Accenture',
 'Wholesalers': 'McKesson',
 'Health Care': 'Gilead Sciences',
 'Telecommunications': 'KDDI',
 'Engineering & Construction': 'Pacific Construction Group',
 'Industrials': '3M',
 'Food & Drug Stores': 'Publix Super Markets',
 'Aerospace & Defense': 'Lockheed Martin',
 'Food, Beverages & Tobacco': 'Philip Morris International',
 'Household Products': 'Unilever',
 'Transportation': 'Delta Air Lines',
 'Materials': 'CRH',
 'Chemicals': 'LyondellBasell Industries',
 'Media': 'Disney',
 'Apparel': 'Nike',
 'Hotels, Restaurants & Leisure': 'McDonald’s',
 'Business Services': 'Adecco Group'}

Wow, awesome!

---

### Basic data cleaning in Pandas
---

In reality, data is rarely in the format you need it to be to perform analysis. Data scientists commonly spend over half their time cleaning data, so knowing how to clean 'messy' data is an extremely important skill.

We'll be working with `laptops.csv`, a CSV file containing information on about 1,300 laptop computers.

Now, text files have different types of encodings. But the best thing to do if your file has an unknown encoding is to try the most common encodings. The most common encodings are, in order:

* UTF-8 (the default for Python)
* Latin-1 (also known as ISO-8895-1)
* Windows-1251

To specify a encoding when reading a CSV file with pandas, simply use the `encoding` argument within the `pandas.read_csv()` function, specifying the encoding as a string:

```python
df = pd.read_csv("filename.csv", encoding="UTF-8")
```

Because UTF-8 is the default, you don't need to specify that the file you're reading is encoded with UTF-8 (you'll notice the error message mentions UTF-8).

Hint: the dataset's encoding is not UTF-8!

In [45]:
import pandas as pd
laptops = pd.read_csv("laptops.csv", encoding = "Latin-1")
laptops.info() # df.info() method shows information (sic!) about the dataframe

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1303 entries, 0 to 1302
Data columns (total 13 columns):
Manufacturer                1303 non-null object
Model Name                  1303 non-null object
Category                    1303 non-null object
Screen Size                 1303 non-null object
Screen                      1303 non-null object
CPU                         1303 non-null object
RAM                         1303 non-null object
 Storage                    1303 non-null object
GPU                         1303 non-null object
Operating System            1303 non-null object
Operating System Version    1133 non-null object
Weight                      1303 non-null object
Price (Euros)               1303 non-null object
dtypes: object(13)
memory usage: 132.4+ KB


One of the columns, `Operating System Version`, has some null values. The column labels have a variety of upper and lowercase letters, as well as spaces and parentheses. Let's clean the column labels (we'll use some of the python built-in [string methods](https://docs.python.org/3/library/stdtypes.html#string-methods) ([pandas reference](http://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html?highlight=string#string-methods)):

In [46]:
laptops.columns # displays the labels of the columns

Index(['Manufacturer', 'Model Name', 'Category', 'Screen Size', 'Screen',
       'CPU', 'RAM', ' Storage', 'GPU', 'Operating System',
       'Operating System Version', 'Weight', 'Price (Euros)'],
      dtype='object')

In [47]:
def clean_col(col):
    col = col.strip()
    col = col.replace("(","")
    col = col.replace(")","")
    col = col.lower()
    return col

laptops.columns = [clean_col(c) for c in laptops.columns]
laptops.columns

Index(['manufacturer', 'model name', 'category', 'screen size', 'screen',
       'cpu', 'ram', 'storage', 'gpu', 'operating system',
       'operating system version', 'weight', 'price euros'],
      dtype='object')

In [48]:
def clean_string(dirty_string):
    return dirty_string.strip().replace(
        "operating system", "os").replace(" ", "_").replace("(", "").replace(")", "").lower()

laptops.columns = [clean_string(column) for column in laptops.columns] # assigns new columns

In [49]:
laptops.head()

Unnamed: 0,manufacturer,model_name,category,screen_size,screen,cpu,ram,storage,gpu,os,os_version,weight,price_euros
0,Apple,MacBook Pro,Ultrabook,"13.3""",IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8GB,128GB SSD,Intel Iris Plus Graphics 640,macOS,,1.37kg,133969
1,Apple,Macbook Air,Ultrabook,"13.3""",1440x900,Intel Core i5 1.8GHz,8GB,128GB Flash Storage,Intel HD Graphics 6000,macOS,,1.34kg,89894
2,HP,250 G6,Notebook,"15.6""",Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,256GB SSD,Intel HD Graphics 620,No OS,,1.86kg,57500
3,Apple,MacBook Pro,Ultrabook,"15.4""",IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16GB,512GB SSD,AMD Radeon Pro 455,macOS,,1.83kg,253745
4,Apple,MacBook Pro,Ultrabook,"13.3""",IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8GB,256GB SSD,Intel Iris Plus Graphics 650,macOS,,1.37kg,180360


As we've seen, each value in table is an object-type (or a string). Let's convert some columns to numbers (for example, screen_size). We'll use these methods:

* `Series.str.replace()`
* [`Series.astype()`](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.astype.html) (use the type as a parameter)
* [`DataFrame.rename()`](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html) (to rename specific axis labels using a dictionary with the keys as the old label name, and the values as the new label name) (you need to specify `axis=1` parameter so pandas knows that we want to rename labels in the column axis, and also use `inplace=True` instead of assignment (although assigning back to the DataFrame would give us an identical result))

In [50]:
laptops["screen_size"].unique() # let's look for the patterns

array(['13.3"', '15.6"', '15.4"', '14.0"', '12.0"', '11.6"', '17.3"',
       '10.1"', '13.5"', '12.5"', '13.0"', '18.4"', '13.9"', '12.3"',
       '17.0"', '15.0"', '14.1"', '11.3"'], dtype=object)

In [51]:
laptops["screen_size"] = laptops["screen_size"].str.replace('"','').astype(float)
laptops.rename({"screen_size": "screen_size_inches"}, axis = 1, inplace = True)

In [52]:
laptops["ram"].unique() # doing same with the ram column

array(['8GB', '16GB', '4GB', '2GB', '12GB', '6GB', '32GB', '24GB', '64GB'],
      dtype=object)

In [53]:
laptops["ram"] = laptops["ram"].str.replace("GB", "").astype(int)
laptops.rename({"ram": "ram_gb"}, axis = 1, inplace = True)

laptops.dtypes # inspecting the new types

manufacturer           object
model_name             object
category               object
screen_size_inches    float64
screen                 object
cpu                    object
ram_gb                  int64
storage                object
gpu                    object
os                     object
os_version             object
weight                 object
price_euros            object
dtype: object

Let's do the same to the `weight` and `price_euros` columns:

In [54]:
laptops["weight"].unique()

array(['1.37kg', '1.34kg', '1.86kg', '1.83kg', '2.1kg', '2.04kg', '1.3kg',
       '1.6kg', '2.2kg', '0.92kg', '1.22kg', '0.98kg', '2.5kg', '1.62kg',
       '1.91kg', '2.3kg', '1.35kg', '1.88kg', '1.89kg', '1.65kg',
       '2.71kg', '1.2kg', '1.44kg', '2.8kg', '2kg', '2.65kg', '2.77kg',
       '3.2kg', '0.69kg', '1.49kg', '2.4kg', '2.13kg', '2.43kg', '1.7kg',
       '1.4kg', '1.8kg', '1.9kg', '3kg', '1.252kg', '2.7kg', '2.02kg',
       '1.63kg', '1.96kg', '1.21kg', '2.45kg', '1.25kg', '1.5kg',
       '2.62kg', '1.38kg', '1.58kg', '1.85kg', '1.23kg', '1.26kg',
       '2.16kg', '2.36kg', '2.05kg', '1.32kg', '1.75kg', '0.97kg',
       '2.9kg', '2.56kg', '1.48kg', '1.74kg', '1.1kg', '1.56kg', '2.03kg',
       '1.05kg', '4.4kg', '1.90kg', '1.29kg', '2.0kg', '1.95kg', '2.06kg',
       '1.12kg', '1.42kg', '3.49kg', '3.35kg', '2.23kg', '4.42kg',
       '2.69kg', '2.37kg', '4.7kg', '3.6kg', '2.08kg', '4.3kg', '1.68kg',
       '1.41kg', '4.14kg', '2.18kg', '2.24kg', '2.67kg', '2.14kg',
       '1.

While it appears that the weight column may just need the kg characters removed from the end of each string, there are a lot of unique values for the weight column, so it's hard to visually confirm if there are any exceptions to the pattern.

If we can't see any exceptions, it's OK to move forward onto the next step, as if we miss any, the error we get will tell us the value so we can fix it.

```python
laptops["weight"] = (laptops["weight"]
                        .str.replace("kg","")
                        .astype(float)
                    )
```

This will envoke en error:

```python
ValueError: could not convert string to float: '4s'
```

Keep in mind that this is the value after the `kg` has been replaced because of our method chaining (`.astype()`), so the value substring `'4s'` may not actually exist in the raw data. We can use the pandas [`Series.str.contains()`](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html) method, which returns a boolean series based on whether a substring is found to look at the raw value:

In [55]:
laptops.loc[laptops["weight"].str.contains('s'), "weight"]

1061    4kgs
Name: weight, dtype: object

In [56]:
laptops["weight"] = laptops["weight"].str.replace("kgs", "").str.replace("kg", "").astype(float)
laptops.rename({"weight": "weight_kg"}, axis = 1, inplace = True)

In [57]:
laptops["price_euros"].unique()

array(['1339,69', '898,94', '575,00', '2537,45', '1803,60', '400,00',
       '2139,97', '1158,70', '1495,00', '770,00', '393,90', '344,99',
       '2439,97', '498,90', '1262,40', '1518,55', '745,00', '2858,00',
       '499,00', '979,00', '191,90', '999,00', '258,00', '819,00',
       '659,00', '418,64', '1099,00', '800,00', '1298,00', '896,00',
       '244,99', '199,00', '439,00', '1869,00', '998,00', '249,00',
       '367,00', '488,69', '879,00', '389,00', '1499,00', '522,99',
       '682,00', '1419,00', '369,00', '1299,00', '639,00', '466,00',
       '319,00', '841,00', '398,49', '1103,00', '384,00', '767,80',
       '586,19', '2449,00', '415,00', '599,00', '941,00', '690,00',
       '1983,00', '438,69', '229,00', '549,00', '949,00', '1089,00',
       '955,00', '870,00', '1095,00', '519,00', '855,00', '530,00',
       '977,00', '1096,16', '1510,00', '860,00', '399,00', '395,00',
       '1349,00', '699,00', '598,99', '1449,00', '1649,00', '689,00',
       '1197,00', '1195,00', '1049,0

In [58]:
laptops["price_euros"] = laptops["price_euros"].str.replace(",", ".").astype(float)

In [59]:
laptops["weight_kg"].describe()

count    1303.000000
mean        2.038734
std         0.665475
min         0.690000
25%         1.500000
50%         2.040000
75%         2.300000
max         4.700000
Name: weight_kg, dtype: float64

In [60]:
laptops["price_euros"].describe()

count    1303.000000
mean     1123.686992
std       699.009043
min       174.000000
25%       599.000000
50%       977.000000
75%      1487.880000
max      6099.000000
Name: price_euros, dtype: float64

Here's an example of using the [`Series.str.split()`](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.split.html) method (with explanations in comments):

In [61]:
laptops["gpu_manufacturer"] = (laptops["gpu"] # selects the column
                            .str.split(n=1,expand=True) # n controls the maximum number of splits allowed
                            .iloc[:,0] # expand = True will expand the series of lists into a dataframe
                            ) # .iloc selects only the first column

laptops["cpu_manufacturer"] = (laptops["cpu"]
                            .str.split(n=1,expand=True)
                            .iloc[:,0]
                            )

In [62]:
laptops.head().iloc[:, -2:]

Unnamed: 0,gpu_manufacturer,cpu_manufacturer
0,Intel,Intel
1,Intel,Intel
2,Intel,Intel
3,AMD,Intel
4,Intel,Intel


Let's work with `screen` column as well:

In [63]:
laptops["screen"].unique()

array(['IPS Panel Retina Display 2560x1600', '1440x900',
       'Full HD 1920x1080', 'IPS Panel Retina Display 2880x1800',
       '1366x768', 'IPS Panel Full HD 1920x1080',
       'IPS Panel Retina Display 2304x1440',
       'IPS Panel Full HD / Touchscreen 1920x1080',
       'Full HD / Touchscreen 1920x1080',
       'Touchscreen / Quad HD+ 3200x1800',
       'IPS Panel Touchscreen 1920x1200', 'Touchscreen 2256x1504',
       'Quad HD+ / Touchscreen 3200x1800', 'IPS Panel 1366x768',
       'IPS Panel 4K Ultra HD / Touchscreen 3840x2160',
       'IPS Panel Full HD 2160x1440',
       '4K Ultra HD / Touchscreen 3840x2160', 'Touchscreen 2560x1440',
       '1600x900', 'IPS Panel 4K Ultra HD 3840x2160',
       '4K Ultra HD 3840x2160', 'Touchscreen 1366x768',
       'IPS Panel Full HD 1366x768', 'IPS Panel 2560x1440',
       'IPS Panel Full HD 2560x1440',
       'IPS Panel Retina Display 2736x1824', 'Touchscreen 2400x1600',
       '2560x1440', 'IPS Panel Quad HD+ 2560x1440',
       'IPS Panel 

Some of the values have only resolutions, some have descriptions as well. But alas, the resolution comes last, so we can't use `Series.str.split()` method with `n = 1` parameter! Luckily, we have [`Series.str.rsplit()`](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.rsplit.html), which works from the end of the string!

In [64]:
laptops["screen_resolution"] = laptops["screen"].str.rsplit(expand = True, n = 1).iloc[:, -1]

In [65]:
laptops["screen_resolution"].unique()

array(['2560x1600', None, '1920x1080', '2880x1800', '2304x1440',
       '3200x1800', '1920x1200', '2256x1504', '1366x768', '3840x2160',
       '2160x1440', '2560x1440', '2736x1824', '2400x1600'], dtype=object)

Whoops, we have some problems (`None` values)! Let's see what we did:

In [66]:
laptops["screen"].str.rsplit(expand = True, n = 1).head()

Unnamed: 0,0,1
0,IPS Panel Retina Display,2560x1600
1,1440x900,
2,Full HD,1920x1080
3,IPS Panel Retina Display,2880x1800
4,IPS Panel Retina Display,2560x1600


We can replace the `None` values with the value from the `0` column using boolean indexing and `Series.isnull()` method:

*NB! If you think about using a for loop - think about boolean indexing!*

In [67]:
screen_columns = laptops["screen"].str.rsplit(expand = True, n = 1)
screen_columns.loc[screen_columns.loc[:, 1].isnull(), 1] = screen_columns.loc[:, 0]
screen_columns[1].unique()

array(['2560x1600', '1440x900', '1920x1080', '2880x1800', '1366x768',
       '2304x1440', '3200x1800', '1920x1200', '2256x1504', '3840x2160',
       '2160x1440', '2560x1440', '1600x900', '2736x1824', '2400x1600'],
      dtype=object)

In [68]:
laptops["screen_resolution"] = screen_columns[1]

In [69]:
laptops["cpu"].unique()[:10]

array(['Intel Core i5 2.3GHz', 'Intel Core i5 1.8GHz',
       'Intel Core i5 7200U 2.5GHz', 'Intel Core i7 2.7GHz',
       'Intel Core i5 3.1GHz', 'AMD A9-Series 9420 3GHz',
       'Intel Core i7 2.2GHz', 'Intel Core i7 8550U 1.8GHz',
       'Intel Core i5 8250U 1.6GHz', 'Intel Core i3 6006U 2GHz'],
      dtype=object)

Let's do the same to the `cpu` column. We'll need also to replace `'GHz'`:

In [70]:
laptops["cpu_speed_ghz"] = (laptops["cpu"]
                            .str.replace("GHz", "")
                            .str.rsplit(n = 1, expand = True)
                            .iloc[:, 1]
                            .astype(float)
                           )

In [71]:
laptops["cpu_speed_ghz"].unique()

array([2.3 , 1.8 , 2.5 , 2.7 , 3.1 , 3.  , 2.2 , 1.6 , 2.  , 2.8 , 1.2 ,
       2.9 , 2.4 , 1.44, 1.5 , 1.9 , 1.1 , 1.3 , 2.6 , 3.6 , 3.2 , 1.  ,
       2.1 , 0.9 , 1.92])

If your data has been scraped from a webpage, or if there was manual data entry involved at some point, you may end up with inconsistent values. Let's look at an example from our `os` column:

In [72]:
laptops["os"].value_counts()

Windows      1125
No OS          66
Linux          62
Chrome OS      27
macOS          13
Mac OS          8
Android         2
Name: os, dtype: int64

We can see that there are two variations on how the Apple operating system macOS exists in our dataset: `Mac OS` and `macOS`. One way we could fix this is by using a boolean comparison and assignment, but instead we'll use a new way: the [`Series.map()`](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.map.html) method. The `Series.map()` method is ideal when we want to change multiple values in a column. Even though that's not the case here, we'll use it as an opportunity to learn how the method works.

The most common way to use `Series.map()` is with a dictionary. The keys of our dictionary are the original values in our series, and the corresponding values are what they're updated to:

In [73]:
mapping_dict = {
    'Android': 'Android',
    'Chrome OS': 'Chrome OS',
    'Linux': 'Linux',
    'Mac OS': 'macOS',
    'No OS': 'No OS',
    'Windows': 'Windows',
    'macOS': 'macOS'
}

laptops["os"] = laptops["os"].map(mapping_dict)

In [74]:
laptops["os"].value_counts()

Windows      1125
No OS          66
Linux          62
Chrome OS      27
macOS          21
Android         2
Name: os, dtype: int64

One important thing to remember with `Series.map()` is that if a value from your series doesn't exist as a key in your dictionary, it will convert that value to `NaN`. Let's see what happens when we run map with keys like those:

In [75]:
mapping_dict = {
    'android': 'Android',
    'chrome OS': 'Chrome OS',
    'linux': 'Linux',
    'mac OS': 'macOS',
    'no OS': 'No OS',
    'windows': 'Windows',
    'MacOS': 'macOS'
}

In [76]:
laptops["os"].map(mapping_dict).head()

0    NaN
1    NaN
2    NaN
3    NaN
4    NaN
Name: os, dtype: object

Because none of the corrected values in our series existed as keys in our dictionary, all values become `NaN`! It's very common to come across this, especially when working in Jupyter notebook where you can easily re-run cells.

In pandas null values will be indicated by either `NaN` or `None`. Generally the first thing that we want to do is identify which values are missing.

There are two approaches we can use: the [`DataFrame.info()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.info.html) method and the [`DataFrame.isnull()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.isnull.html) method. The `DataFrame.info()` method will print information about the dataframe, including the number of non-null values in each column:

In [77]:
laptops.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1303 entries, 0 to 1302
Data columns (total 17 columns):
manufacturer          1303 non-null object
model_name            1303 non-null object
category              1303 non-null object
screen_size_inches    1303 non-null float64
screen                1303 non-null object
cpu                   1303 non-null object
ram_gb                1303 non-null int64
storage               1303 non-null object
gpu                   1303 non-null object
os                    1303 non-null object
os_version            1133 non-null object
weight_kg             1303 non-null float64
price_euros           1303 non-null float64
gpu_manufacturer      1303 non-null object
cpu_manufacturer      1303 non-null object
screen_resolution     1303 non-null object
cpu_speed_ghz         1303 non-null float64
dtypes: float64(4), int64(1), object(12)
memory usage: 173.1+ KB


Looking at the number of non-null values can be harder to understand than looking at the number of null values. In contrast, `DataFrame.isnull()` returns a boolean dataframe with `True` and `False` indications for every value in the dataframe, and then we can use `DataFrame.sum()` to give us accounts using a `.sum()` method on a boolean array will give us a count of the `True` values:

In [78]:
laptops.isnull().sum()

manufacturer            0
model_name              0
category                0
screen_size_inches      0
screen                  0
cpu                     0
ram_gb                  0
storage                 0
gpu                     0
os                      0
os_version            170
weight_kg               0
price_euros             0
gpu_manufacturer        0
cpu_manufacturer        0
screen_resolution       0
cpu_speed_ghz           0
dtype: int64

Much clearer, huh?

We have a few options for how we can handle missing values:

* Remove any rows that have missing values
* Remove any columns that have missing values
* Fill the missing values with some other value
* Leave the missing values as is

The first two options, removing columns and/or rows with missing values is often used when preparing data for machine learning, as machine learning algorithms are unable to be trained on data that includes null values. The methods that we use to remove rows and columns with null values is the [`DataFrame.dropna()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html) method. As a result, removing columns and rows is commonly known as *dropping*. The method accepts an `axis` (0 by default, so it'll delete rows with null values) parameter, which indicates whether we want to drop along the column or index axis.

In [79]:
laptops_no_null_cols = laptops.dropna(axis = 1)
laptops_no_null_rows = laptops.dropna()

While choosing to drop either the rows or columns is the easiest approach to dealing with missing values, it may not always be the best approach. If, for example, one particular manufacturer's laptops have a greater percentage of missing values for the `os_version` column, we might have removed a disproportionate amount of that manufacturer's laptops, which would affect our analysis. Let's explore this column:

In [80]:
laptops["os_version"].value_counts(dropna=False) # dropna by default is True so it doesn't count null values

10      1072
NaN      170
7         45
X          8
10 S       8
Name: os_version, dtype: int64

We can see that the majority of values in the column are `10`, with the missing values the next most common, and then about 5% of values being one of three others.

We can also explore values of the other columns in the rows with null values. In this case, the `os_version` column is closely related to the `os` column, so we'll look at those values:

In [81]:
os_with_null_v = laptops.loc[laptops["os_version"].isnull(),"os"]
os_with_null_v.value_counts()

No OS        66
Linux        62
Chrome OS    27
macOS        13
Android       2
Name: os, dtype: int64

Immediately we can observe a few things:

* Most of the missing values are actually when the laptop doesn't include any OS. This is an important distinction, because it's not so much that we don't know what the value is, as that there can't be a value
* 13 of the laptops that come with macOS do not specify the version. Leaning on our knowledge of MacOS, we might know that the full name of `macOS` used to be `Mac OS X`, and so we might to fill these values to be more consistent

Let's try to fix this:

In [82]:
value_counts_before = laptops.loc[laptops["os_version"].isnull(), "os"].value_counts()

laptops.loc[laptops["os"] == "macOS", "os_version"] = "X"
laptops.loc[laptops["os"] == "No OS", "os_version"] = "Version Unknown"

value_counts_after = laptops.loc[laptops["os_version"].isnull(), "os"].value_counts()

In [83]:
value_counts_before

No OS        66
Linux        62
Chrome OS    27
macOS        13
Android       2
Name: os, dtype: int64

In [84]:
value_counts_after

Linux        62
Chrome OS    27
Android       2
Name: os, dtype: int64

Let's fix the `storage` column. We want to create 4 columns for the capacity and the type of disks. If there are 2 disks, then this should be indicated as well. If there is only 1 disk, then the last two columns should have null values. We'll use here a new method, [`DataFrame.drop()`](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html?highlight=drop#pandas.DataFrame.drop):

In [85]:
storage_df = (laptops["storage"]
                .str.replace("TB", "000")
                .str.replace("GB", "")
                .str.split(pat = "+", expand = True)
               )
storage_1 = storage_df.loc[:, 0].str.strip().str.split(n = 1, expand = True)
storage_2 = storage_df.loc[:, 1].str.strip().str.split(n = 1, expand = True)

laptops["storage_1_capacity_gb"] = storage_1[0].astype(float)
laptops["storage_1_type"] = storage_1[1]
laptops["storage_2_capacity_gb"] = storage_2[0].astype(float)
laptops["storage_2_type"] = storage_2[1]

laptops = laptops.drop("storage", axis = 1)

In [86]:
laptops.head().iloc[:, -4:]

Unnamed: 0,storage_1_capacity_gb,storage_1_type,storage_2_capacity_gb,storage_2_type
0,128.0,SSD,,
1,128.0,Flash Storage,,
2,256.0,SSD,,
3,512.0,SSD,,
4,256.0,SSD,,


Let's save the cleaned dataset as a csv, using [`DataFrame.to_csv()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html) (`index = False` means that the method won't record the indeces as columns/rows) method (and reorder the columns):

In [87]:
laptops_dtypes = laptops.dtypes
cols = ['manufacturer', 'model_name', 'category', 'screen_size_inches',
        'screen', 'cpu', 'cpu_manufacturer', 'screen_resolution', 'cpu_speed_ghz', 'ram_gb',
        'storage_1_type', 'storage_1_capacity_gb', 'storage_2_type',
        'storage_2_capacity_gb', 'gpu', 'gpu_manufacturer', 'os',
        'os_version', 'weight_kg', 'price_euros']

laptops = laptops[cols]
laptops.to_csv('laptops_cleaned.csv', index = False)

# **W E L L D O N E**