## Download a csv from a jupyter notebook

```python
from IPython.display import HTML
import base64  
import pandas as pd  

def create_download_link( df, title = "Download CSV file", filename = "data.csv"):  
    csv = df.to_csv(index =False)
    b64 = base64.b64encode(csv.encode())
    payload = b64.decode()
    html = '<a download="{filename}" href="data:text/csv;base64,{payload}" target="_blank">{title}</a>'
    html = html.format(payload=payload,title=title,filename=filename)
    return HTML(html)

create_download_link(df)

# Introduction

This first project is an introduction to ``pandas``, the most popular data-management tool in Python.

Pandas is our swiss knife when it comes to Data Analysis/Science in Python. We use it to:

- **Load/dump read/write data**: to and from different formats (CSV, XML, HTML, Excel, JSON, even from the Internet)
- **Analyze data**: perform statistical analysis, query the data, find inconsistencies, etc
- **Data cleaning**: finding missing values, duplicate data, invalid or broken values, etc
- **Visualizations**: with support from ``matplotlib``, we can quickly visualize data
- **Data Wrangling/Munging**: a non-so-scientific term that involves data handling: merging multiple data sources, creating derived representations, grouping data, etc.

In this project you will not learn much about how to use Pandas, but you'll see it in action. So, don't worry if you don't feel comfortable "doing" what's shown here, it'll all be explained in the following projects.

Let's get started! Switch to the next page and start your lab!

## Loading the data

Have you started your lab? If you haven't yet, please go ahead and start the lab. Also, execute the first couple of cells:

```python
import pandas as pd
df = pd.read_csv("s&p500.csv", index_col='Date', parse_dates=True)
df.head()
df.tail()
```

We first start importing the ``pandas`` library, and as we use it SO much, we like to create a short alias ``pd``. We then load the sample dataset for this project: the S&P500 index from 2017 to 2022.

We load the data using the ``read_csv`` method. Throughout these labs, you'll see that pandas can load data from a lot of different formats, and methods are usually ``read_XXX``; for example: ``read_json``, ``read_excel``, ``read_xml``, etc.

We've now loaded the data contained in the CSV into the variable ``df``: a DataFrame. DataFrames are the key data structure used by Pandas and you'll see A LOT of them in the following projects; so, don't worry too much about it for now.

Then, we take a few quick peeks at the data with the ``.head()`` and ``.tail()`` methods. This is because pandas is prepared to handle MILLIONS of rows (or even more). So we don't usually "print" the whole data, we just take quick peeks at it.

The ``.head()`` method shows the first 5 rows, the ``.tail()`` method shows the last 5 rows. You can immediately see that the DataFrame looks pretty much like an Excel table. It contains an index, which is the date of the reading.

In [None]:
import pandas as pd

df = pd.read_csv("SP500 index 2017 2022.csv", index_col='Date', parse_dates=True)

In [None]:
df.head()

In [None]:
df.tail()

## Analyzing data

The analysis phase is of course dependant of the task at hand, and the data at hand. This is just an example of the capabilities of pandas.

We start by using the ``.describe()`` method, that provides quick summary statistics of the whole DataFrame. We have information like the ``mean`` (the average), ``max``, etc.

We can also get specific information for a single column: ``df['Close'].min()`` or ``df['Close'].max()``. Oh, by the way, you've just seen how to perform "single column selection": ``df['Close'].head()``.

In [None]:
df.describe()

Single column statistics

In [None]:
df['Close'].min()

In [None]:
df['Close'].max()

Single column selection:

In [None]:
df['Close'].head()

## Visualizations

Pandas makes it simple to visualize data with the ``.plot()`` method. In reality, ``.plot()`` is just a wrapper around ``matplotlib``, the de-facto plotting library for Python.

As you can see, plotting a column is very easy; just: ``df['Close'].plot()``.

You can see that we're creating more advanced visualizations by combining multiple columns or by creating statistical visualizations (box plots, histograms, etc).

In [None]:
df['Close'].plot(figsize=(14, 7), title='S&P Closing Price | 2017 - 2022')

A more advanced chart combining ``Close Price`` and ``Volume``:

In [None]:
ax1 = df['Close'].plot(figsize=(14, 7), title='S&P Closing Price | 2017 - 2022')

ax2 = ax1.twinx()
df['Volume'].plot(ax=ax2, color='red', ylim=[df['Volume'].min(), df['Volume'].max() * 5])

ax1.figure.legend(["Close", "Volume"])

A few statistical visualizations.

A histogram:

In [None]:
df['Volume'].plot(kind='hist')

A box plot:

In [None]:
df['Volume'].plot(kind='box', vert=False)

## Data Wrangling

Pandas excels at Data Wrangling/handling/munging. We can perform a ton of operations, like combining datasets, grouping, melting, creating pivot tables, etc. We have an entire Skill Track just dedicated to Data Wrangling, so you can guess how powerful it is.

For now, we'll focus on just a few simple operations. We'll calculate [Bollinger Bands](https://en.wikipedia.org/wiki/Bollinger_Bands) for our S&P500 data.

Bollinger bands are just a simple visualization/analysis technique that creates two bands, one "roof" and one "floor" of some "support" for a given time series. The reasoning is that, if the time series is "below" the "floor", it's a historic low, and if it's "above" the "roof", it's a historic high. In terms of stock prices and other financial instruments, when the price crosses a band, it's said to be too cheap or too expensive.

> **This is definitively NOT investment advice. Bollinger bands have proved to be INACCURATE, so don't use them in real life. This is just for educational purposes.**

A Bollinger band is defined as two standard deviations above/below the Simple Moving Average. Those are a lot of concepts, but basically we can first define the Simple Moving Average, using the ``.rolling(WINDOW).mean()`` method (switch to the lab to follow along).

Understanding the SMA is outside of the scope of this project, but it's basically a "smoothing" method. You see how the SMA *follows* the Close Price, but without so much volatility.

Now, to define the bands we need to calculate 2 standard deviations above/below the price:

```python
df['Lower Band'] = df['Close SMA'] - (2 * df['Close'].rolling(60).std())
df['Upper Band'] = df['Close SMA'] + (2 * df['Close'].rolling(60).std())
```

The final result should look something like:

In [None]:
df['Close SMA'] = df['Close'].rolling(60).mean()

Comparamos el nuevo SMA con el valor de cierre de la acción

In [None]:
df[['Close', 'Close SMA']].tail(10)

In [None]:
ax = df[['Close', 'Close SMA']].plot(figsize=(14,7), title='Close Price & its SMA')

Calcularemos las bandas de Bollinger

In [None]:
df['Lower Band'] = df['Close SMA'] - (2 * df['Close'].rolling(60).std())
df['Upper Band'] = df['Close SMA'] + (2 * df['Close'].rolling(60).std())

In [None]:
df[['Close', 'Close SMA', 'Lower Band', 'Upper Band']].tail()

In [None]:
df[['Close', 'Lower Band', 'Upper Band']].plot(figsize=(14,7), title='Close Price & its SMA')

Ahora encontraremos los puntos bajos que cruzan la banda baja

In [None]:
ax = df[['Close', 'Lower Band', 'Upper Band']].plot(figsize=(14, 7), title='Close Price & its SMA')
ax.annotate(
    "Let's find this point", xy=(pd.Timestamp("2020-03-23"), 2237), 
    xytext=(0.9, 0.1), textcoords='axes fraction',
    arrowprops=dict(facecolor='red', shrink=0.05),
    horizontalalignment='right', verticalalignment='bottom');

Podemos hacer un query de todas las fechas que cruzaron la banda baja en el periodo ``2020-03-01`` a ``2020-06-01`` 

In [None]:
df.loc['2020-03-01': '2020-06-01'].query("Close < `Lower Band`").head()

Y podemos hacer un zoom también a ese periodo

In [None]:
df.loc['2020-01-01': '2020-06-01', ['Close', 'Lower Band', 'Upper Band']].plot(figsize=(14, 7), title='Close Price & its SMA | 2020-01-01 to 2020-06-01');

# Series Practice with World Bank's data

## Introduction

In this lab, you will use pandas to explore the World Bank's data on economic, political, and social indicators for countries around the world. The data is collected from Kaggle.

The World Bank data is organized into a number of different categories, including:

- Economy: This category includes data on GDP, population, inflation, and unemployment.
- Government: This category includes data on government spending, taxes, and debt.
- Social: This category includes data on education, health, and poverty.

The data is stored in excel file named ``world_data.xls``. In this lab, you will learn how to:

- Create pandas series
- Series basic attributes, such as shape, size, and data type
- Access data in pandas series
- Perform basic statistical operations on pandas series

By the end of this lab, you will be able to use pandas to explore and analyze the World Bank's data on economic, political, and social indicators for countries around the world.

> ***Run all the cells that are under Take a look at raw data heading in the notebook.***

Let's get dive into the lab!

In [None]:
import pandas as pd
df = pd.read_csv('world_data.csv')
df.head()

In [None]:
df.columns

Creating a pandas series from a dataframe df

In [None]:
# Converting columns to pandas series
country_name = pd.Series(df['Country Name'])
country_code = pd.Series(df['Country Code'])
population = pd.Series(df[' Population, total '])
gdp = pd.Series(df['GDP, PPP (current international $)'])
internet_users = pd.Series(df['Internet users (per 100 people)'])
life_expectancy = pd.Series(df['2014 Life expectancy at birth, total (years)'])
literacy_rate = pd.Series(df['Literacy rate, adult female (% of females ages 15 and above)'])
exports = pd.Series(df['Exports of goods and services (% of GDP)'])

In [None]:
country_name.head()

In [None]:
country_code.head()

In [None]:
population.head()

In [None]:
gdp.head()

In [None]:
internet_users.head()

In [None]:
life_expectancy.head()

In [None]:
literacy_rate.head()

In [None]:
exports.head()

1. What is the data type of the ``country_name`` series

Al ser ``dtype('O')` sabemos que el tipo de datos es **Objeto** 

In [None]:
country_name.dtype

2. What is the ``size`` of the gdp series

In [None]:
gdp.shape

3. What is the data type of the ``internet_users`` series

In [None]:
internet_users.dtype

4. What is the value of the first element in the ``population`` series

In [None]:
population.iloc[0]

5. What is the value of the last element in the ``life_expectancy`` series

In [None]:
life_expectancy.iloc[-1]

6. What is the value of the element with index 29 in the ``literacy_rate`` series

In [None]:
literacy_rate.iloc[29]

7. What is the value of the last element in the ``gdp`` series

In [None]:
gdp.iloc[-1]

8. What is the mean of the ``internet_users`` series

In [None]:
internet_users.mean()

9. What is the standard deviation of the ``internet_users`` series

In [None]:
internet_users.std( )

10 What is the median of the ``exports`` series

In [None]:
exports.median()

11. What is the minimum value in the ``life_expectancy`` series

In [None]:
life_expectancy.min()

12. What is the ``average`` literacy rate of all countries

In [None]:
literacy_rate.mean()

13. Sort the series in ascending order

In [None]:
country_name_sorted = country_name.sort_values(ascending=True)

14. Sort multiple series at once

    Both the series ``country_name`` and ``literacy_rate`` have the same number of elements and the elements are in the same order with respect to index number. Arrange the country name as per ascending order of literacy rate. Assign the result of country name to new variable called ``country_name_sorted_by_literacy_rate`` and the result of literacy rate to new variable called ``literacy_rate_sorted``.

    Example: If the country name is ``['India', 'China', 'Japan']`` and literacy rate is ``[80, 90, 70]``, then the result should be ``['Japan', 'India', 'China']`` and ``[70, 80, 90]``.

In [None]:
literacy_rate_sorted = literacy_rate.sort_values(ascending=True)
country_name_sorted_by_literacy_rate = country_name.loc[literacy_rate_sorted.index]

# Intro to Pandas Series

## Intro

That is a lot to unpack. Let's better use an example. Take a look at the following "table" that contains a list of Top Companies (in technology) and their revenue (in millions of dollars):


Preview
A pandas Series will help us represent that data. Now it's time to turn on the lab and head to the Notebook, where we'll see how Series work.

The syntax to create a series is:

```python
import pandas as pd
pd.Series(data, index, name="A name")
```

``Series``s main components are:

- **data**: this is the data that we want to represent, and obviously, we could say the "most important" component of the series. In our example, the data is the revenue of the companies.

- **index**: the index indicates the "labels" of the data we're storing. We'll use the index to "reference" the data later. Indices are not required; pandas will assign a default sequential index if we don't provide one.

- **name**: a series can contain a "name"; this will make more sense when we start using DataFrames. For now, just think about it as extra "documentation"; more clarity when working with your code. *Names are optional*.

Finally, it's important to note that Series are "strongly typed": this means they have an associated (**an enforced object type**). It's not like in Python dictionaries, where we can mix types. In this case, you'll see that the series is of type int64 (it says dtype: int64 after the name, at the bottom of the representation). Don't worry too much about it for now, it's basically a Series containing "integers".

We'll represent them using a Series in the following way:

In [None]:
companies = [
    'Apple', 'Samsung', 'Alphabet', 'Foxconn',
    'Microsoft', 'Huawei', 'Dell Technologies',
    'Meta', 'Sony', 'Hitachi', 'Intel',
    'IBM', 'Tencent', 'Panasonic'
]

s = pd.Series([
    274515, 200734, 182527, 181945, 143015,
    129184, 92224, 85965, 84893, 82345,
    77867, 73620, 69864, 63191],
    index=companies,
    name="Top Technology Companies by Revenue")

s

1. **Check your knowledge: create a series**

    Create a series under the variable ``my_series`` that contains three elements ``9``, ``11`` and -``5``. The index of the series should be ``['a', 'b', 'c']`` and the name should be ``"My First Series"``.


In [None]:
import pandas as pd

In [None]:
elements = [9,11,-5]
index = ['a', 'b', 'c']
name = 'My First Series'
my_series = pd.Series(elements, index=index, name= name)

In [None]:
my_series

## Basic selection and location

Series are very flexible about querying/selecting data. You can get data by the index (get the revenue of Apple), by position (get the 5th element) and also by multiple of those.

### Selecting by index

We use the Series' index to reference and locate the data associated with a given label.

For example, to get the revenue of *Apple*, we can do: ``s["Apple"]``. That works, as you can see in the notebook. But you'll also see that we use a ``.loc`` attribute, making it: ``s.loc["Apple"]``. This is the preferred method to reference values. It might make little sense for now, but it will once we start dealing with DataFrames.

In [None]:
s['Apple']

``.loc`` is the preferred way:

In [None]:
s.loc['Apple']

### Selecting by position

We can also select elements by their "order". After all, as we mentioned in the previous section, **Series are ordered data structures**. So we can select an element by its position: for example, the 'first", "last", "third", or "253rd" element. To select an element by its position, we use the ``.iloc`` attribute. The beauty of ``.iloc`` is that, as selection in Python lists, it accepts negative numbers to reference elements from the end of the series. That means that ``.iloc[-1]`` returns the **LAST** element in the series.

In [None]:
s.iloc[0]

In [None]:
s.iloc[-1]

### Errors in selection


As expected, if you try to retrieve an element that doesn't exist, it'll cause an error. This works pretty similarly as in Python dictionaries and lists. Selecting by index (.loc) fails with a ``KeyError`` (like dictionaries) and selecting by position fails with an ``IndexError`` as with lists.

Most of the time, you can prevent these errors using the membership operator ``in``, which checks if a given element is part of the index.

In [None]:
# this code will fail
s.loc["Non existent company"]

In [None]:
# This code also fails, 132 it's out of boundaries
# (there are not so many elements in the Series)
s.iloc[132]

We could prevent these errors using the membership check ``in``:

In [None]:
'Apple' in s

In [None]:
'Snapchat' in s

### Multiple selection

So far, Series look like glorified dictionaries. But this single feature will set them apart.

> **With both, index selection and positional selection, you can pass multiple elements to be returned. This is extremely convenient.**

Pay attention to the value returned: another Series, a "sub-series," we could say, only with the values requested. In Pandas you'll see this pattern everywhere: Series selection returns other series, DataFrames selection (in future lessons) returns other DataFrames or other Series, etc.

Let's see it in action. To select several elements (by index/label), we just pass a list of the labels:

```py
s.loc[["Apple", "Intel", "Sony"]]
```

To select multiple values by position, we also pass a list with the positions:

```py
s.iloc[[0, 5, -1]]
```

In [None]:
s.loc[['Apple', 'Intel', 'Sony']]

In [None]:
s.iloc[[0, 5, -1]]

## Activities

2. **Check your knowledge: location by index**

    Select the revenue of ``Intel`` and store it in a variable named ``intel_revenue``:

In [None]:
intel_revenue = s.loc['Intel']

3. **Check your knowledge: location by position**

    Select the revenue of the "second to last" element in our series ``s`` and store it in a variable named ``second_to_last``:

In [None]:
second_to_last = s.iloc[-2]

4. **Check your knowledge: multiple selection**

    Use multiple label selection to retrieve the revenues of the companies:

    - Samsung
    - Dell Technologies
    - Panasonic
    - Microsoft

In [None]:
sub_series = s.loc[["Samsung", 'Dell Technologies', "Panasonic", 'Microsoft']]
sub_series

## Series Attributes and Methods

Series contain a lot of useful attributes and methods to interact with them. Probably the two most common ones you'll see all the time are ``.head()`` and ``.tail()``. This just returns 5 elements either from the beginning of the series (``.head()``) or from the end of it (``.tail()``). This is useful when you're working with real data (possibly MILLIONS of values). You can also pass a number of elements to return: ``.head(3)`` and ``.tail(2)``.

In [None]:
s.head()

In [None]:
s.tail()

### Main attributes

Once a series is constructed (somehow), we can access all the attributes separately. Namely:

- The data of the series: using the ``.values`` attribute
- The index: using ``.index``
- The name: using ``.name``
- The type assigned: using ``.dtype``
- The number of elements: using ``.size``

In [None]:
s.values

In [None]:
s.index

In [None]:
s.name

In [None]:
s.dtype

In [None]:
s.size

``len`` also works

In [None]:
len(s)

### Statistical methods

But that's not all about attributes and Series. As you might already know, we use Pandas for data processing. And a significant component of data processing is understanding its statistical implications.

The ``.describe()`` method gives you quick summary statistics of your series.

There are also individual methods for each of the values returned by ``.describe()``: ``.max()``, ``.min()``, ``.mean()``, ``.median()``, etc.

There's also a ``quantile()`` method to check for specific quantiles (or percentiles). For example, to get the 75th percentile, you can use: ``s.quantile(.75)``.

## Activities

In [None]:
# Run this cell to complete the activity
american_companies = s[[
    'Meta', 'IBM', 'Microsoft',
    'Dell Technologies', 'Apple', 'Intel', 'Alphabet'
]]
american_companies

We have selected a "sub-series" of only american companies in the variable ``american_companies``. Using that Series, complete the following activities.

5. **What's the average revenue of American Companies?**

    What's the average revenue of the companies contained in the variable ``american_companies``? Enter the whole number (that is, without decimals).

In [None]:
american_companies.mean()

6. **What's the median revenue of American Companies?**

In [None]:
american_companies.median()

## Sorting Series

Sorting series is extremely simple. This is another great feature of pandas in general.

But with Sorting, we'll introduce two important concepts:

### Sorting by values or Index

First, what are we sorting by? The values? Or the index? Well, we'll be able to sort by both attributes: using the ``.sort_values()`` and ``.sort_index()`` methods.

Check the examples in the notebook. To sort the values of the series (that is, the revenue), we use the ``.sort_values()`` method. To sort the series by its index (in this case, lexicographically by the company's name, we use the ``.sort_index()`` method). The default sorting method is in "ascending" order. To sort in descending order, you must pass the ``ascending=False`` parameter (to either method).

7. **What company has the largest revenue?**

    Using all the companies (stored in the Series in ``s``), which company has the largest revenue?

In [None]:
s.sort_values(ascending=False)[:1]

8. **Sort company names lexicographically. Which one comes first?**

    Using all the companies (stored in the Series in ``s``), which name is the "first" one in lexicographic (*or alphabetical*) order. That is, ``aa`` comes before than ``ab``.

In [None]:
s.sort_index(ascending=True)[:1]

## Inmutability

The second important concept is **immutability**, and this is NOT just a Series concept; it's a widespread concept in pandas and Data Science in general. In this case, you'll see that when we "sort a series", **we don't ACTUALLY sort the series itself**. There's a NEW series returned. The underlying series has NOT changed; it has NOT been mutated.

This is a CRUCIAL concept in Data Science in general. We don't want to change/mutate things, as it's harder to keep track of these changes.

If by any chance, you DO want to mutate your series, in this case, you want to sort it and alter the underlying series (in ``s`` in this case), you must pass the ``inplace=True`` attribute. When doing so, you'll see that this time the method doesn't return anything, but the underlying series (in s has changed) to contain the data in the order required.

Again, immutability is both preferred and encouraged, so try to use immutable methods as much as possible. For example, it is fine to create a second variable with the values sorted (``s_sorted_values``) and without changing ``s``.

In [None]:
s.head()

We will sort the series by revenue, ascending, and we'll mutate the original one. Notice how the method doesn't return anything:

In [None]:
s.sort_values(inplace=True)

But now the series is sorted by revenue in ascending order:

In [None]:
s.head()

We'll now sort the series by index, mutating it again:

In [None]:
s.sort_index(inplace=True)

In [None]:
s.head()

## Activities

9. **Sort American Companies by Revenue**

    Create a new variable ``american_companies_desc`` that contains the results of sorting ``american_companies`` by revenue (this is, by value) in descending order.

In [None]:
# Run this cell to complete the activity
american_companies = s[[
    'Meta', 'IBM', 'Microsoft',
    'Dell Technologies', 'Apple', 'Intel', 'Alphabet'
]]

In [None]:
american_companies_desc = american_companies.sort_values(ascending=False)
american_companies_desc

10. **Sort (and mutate) international companies**
    
    Now it's time to do what we told you NOT to do, but we need practice it. There's a new series defined named international_companies. Your task is to sort them by Revenue in **descending order** (larger to smaller) but doing it in place, that is, modifying the series.

    If you make a mistake, you can always re-run the cell that generates the Series.

In [None]:
# Run this cell to complete the activity
international_companies = s[[
    "Sony", "Tencent", "Panasonic",
    "Samsung", "Hitachi", "Foxconn", "Huawei"
]]
international_companies

In [None]:
international_companies.sort_values(ascending=False, inplace=True)
international_companies.head()

## Modifying Series

Modifying series is something we hardly want to do. As mentioned in the previous section, we try to be "immutable". So changing series is usually not recommended.

But still, it's possible to modify series by changing values, adding or removing elements. This works in the same way as with Python dictionaries.

For example, to modify an existing value, we can just "step over it", let's say we want to set IBM's revenue to $0. We can just do:

```py
s['IBM'] = 0
```
To add elements (or change the value of an element), you can just use the index of the new element: ``s['Tesla'] = 21450``.

To remove an element, we use the del keyword and the index: ``del s["Apple"]``.

Again, these are the same ways we use to add/remove elements from dictionaries.

Modifying values:

In [None]:
s['IBM']  = 0

In [None]:
s.sort_values().head()

Adding elements:

In [None]:
s['Tesla'] = 21450

In [None]:
s.sort_values().head()

11. **Insert Amazon's Revenue**

    Insert a new element in our series ``s``, Amazon with a total revenue of: ``$469,822`` (million dollars).

In [None]:
s['Amazon'] = 469822

12. **Delete the revenue of Meta**

    Remove the entry for Meta from the series ``s``.

In [None]:
del s['Meta']

## Concatenating Series (immutable)

Finally, if you want to "concatenate" two series, you can use the ``concat()`` method ``s.concat(dataframe1 or series1, dataframe2 or series2)`` as shown in the example in the notebook. In this case, the method returns a new series or dataframe with the values of the two series/dataframe concatenated.

In [None]:
another_s = pd.Series([21_450, 4_120], index=['Tesla', 'Snapchat'])

In [None]:
another_s

In [None]:
s_new = pd.concat([s, another_s])

In [None]:
s

In [None]:
s_new

# Series Practice with S&P Companies Market Cap

## Introduction

Now it's time to put all you've learned about series to a test. Let's start by introducing the data we'll be working with. Make sure you've started your lab, and the Notebook is on the right panel.

For this project, we'll be working with the "market capitalization" of S&P500 (short for "Standard and Poor's 500") companies. The S&P 500 is a free-float, capitalization-weighted index of the top 500 publicly listed stocks in the US (top 500 by market cap). To put it simply: a list of the "most valuable companies in US Markets".

> **Disclaimer: the data is outdated. As you might know, markets change very rapidly.**

But this project has a twist. We will be using two series instead of one that we'll read from two different datasets. The first one is the stock symbols of companies. For example, Apple, Inc. stock symbol is ``AAPL`` (usually styled $AAPL). Facebook's symbol is ``FB`` ($FB).

The second dataset contains the market cap of each company by its symbol. For example, the market cap of ``AAPL`` (Apple Inc.) is ``$809,508,034,020``.

The first thing you'll see in the Notebook is a preview of the underlying datasets we're using: ``sp500-symbols.csv`` and ``sp500-marketcap.csv``. Next, we use the ``head`` Linux command to peek at each file's first five lines.

We then import pandas (``import pandas as pd``) and load the data into series using the ``read_csv`` method. Don't worry about it yet! We'll use A LOT of ``read_csv`` during this track, so you'll get pretty used to it soon.

At the end of those operations, you'll have two series containing the data we'll be working with. One is ``market_cap`` (that includes market cap by symbol), and the other is ``symbols`` that contain the names of the companies and their stock symbol.

Take a few minutes to familiarize yourself with both Series, and then let's get started!


In [None]:
import pandas as pd

In [None]:
market_cap = pd.read_csv("sp500-marketcap.csv", index_col="Symbol")['Market Cap']
market_cap.head()

In [None]:
symbols = pd.read_csv("sp500-symbols.csv", index_col="Name")['Symbol']
symbols.head()

## Basic Series Attributes

We'll start by doing a simple *reconnaissance* of the series we're working with.

1. **Name of the market_cap Series**

    What's the name of the series contained in the ``market_cap`` variable?

In [None]:
market_cap.name

2. **Name of the symbols Series**

    What's the name of the series contained in the ``symbols`` variable?

In [None]:
symbols.name

3. **What's the dtype of ``market_cap``**
    
    What's the dtype of the series contained in the ``market_cap`` variable?

In [None]:
market_cap.dtype

4. **What's the dtype of ``symbols``**

    What's the dtype of the series contained in the symbols variable?

In [None]:
symbols.dtype

5. **How many elements do the series have?**

    How many elements ``market_cap`` series contains?

In [None]:
len(market_cap)

6. **What's the minimum value for Market Cap?**

In [None]:
market_cap.min()

7. **What's the maximum value for Market Cap?**

In [None]:
market_cap.max()


8. **What's the average Market Cap?**

    Find the average value for Market Cap, and enter it WITHOUT decimals. Just the integer number (if you find the average is 1948.88, just enter 1948).

In [None]:
market_cap.mean()

9. **What's the median Market Cap?**

    Find the median value for Market Cap, and enter it WITHOUT decimals. Just the integer number (if you find the median is 1948.0, just enter 1948).

In [None]:
market_cap.median()

## Selection and Indexing

Now it's time to do practice some selection and indexing using Series. We'll start with some basic activities with each series, and by the end we'll be using both of them.

10. **What's the symbol of ``Oracle Corp``.?**

In [None]:
symbols.loc['Oracle Corp.']

11. **What's the Market Cap of ``Oracle Corp.``?**

In [None]:
market_cap.loc['ORCL']

12. **What's the Market Cap of ``Wal-Mart Stores``?**

In [None]:
market_cap.loc[symbols.loc['Wal-Mart Stores']]

13. **What's the symbol of the 129th company?**

In [None]:
symbols.iloc[128]

14. **What's the Market Cap of the 88th company in ``symbols``?**

Warning! The companies might be out of order... so the 88th company in ``symbols`` might not be the same as the 88th one in ``market_cap``. We need you to find the 88th company in ``symbols`` first, and then the the Market Cap from ``market_cap`` for that particular symbol.

In [None]:
market_cap.loc[symbols.iloc[87]]

15. **Create a new series only with FAANG Stocks**

    There's a common term in investing (and in tech) which is FAANG companies. This refers to "big tech" companies by their acronyms. For example, ``FAANG`` means the following companies: Facebook Apple Amazon Netflix and Google (read more about FAANG and Big Tech in Wikipedia).

    > Here FAANG refers to acronym of few companies but there are other big tech companies like Microsoft. So, the term FAANG is not a strict definition of big tech companies.

    Your task is to create a new series, under the variable ``faang_market_cap``, containing the market cap of the following companies:

    - ``Amazon.com Inc``
    - ``Apple Inc.``
    - ``Microsoft Corp.``
    - ``Alphabet Inc Class A`` (this is Google's main stock)
    - ``Facebook, Inc.``
    - ``Netflix Inc.``

    **Important**! The stocks must be in THIS order. You will need to find the Symbols of the companies first.

    Also important, as stated above, you MUST create a variable containing your new series. Your code should look something like:

    ``faang_market_cap = ... # your code``
    
    There's a way to combine everything in a one-liner. Try to solve this task without looking at the solution; but after you've finished it, take a peak at it because there's a neat trick explained at the end of the solution.

In [None]:
market_cap.sort_values(ascending=False).head(8)

In [None]:
faang_market_cap = pd.Series([market_cap.loc[symbols['Amazon.com Inc']],
                              market_cap.loc[symbols['Apple Inc.']],
                              market_cap.loc[symbols['Microsoft Corp.']],
                              market_cap.loc[symbols['Alphabet Inc Class A']],
                              market_cap.loc[symbols['Facebook, Inc.']],
                              market_cap.loc[symbols['Netflix Inc.']]
                              ], index= [symbols['Amazon.com Inc'],
                                         symbols['Apple Inc.'],
                                         symbols['Microsoft Corp.'],
                                         symbols['Alphabet Inc Class A'],
                                         symbols['Facebook, Inc.'],
                                         symbols['Netflix Inc.']])

In [None]:
faang_market_cap

One neat trick with Pandas is that we can use the values of one series to select elements from another series. So we could have just done:

In [None]:
faang_market_cap2 = market_cap[symbols[["Amazon.com Inc", "Apple Inc.", "Microsoft Corp.", "Alphabet Inc Class A", "Facebook, Inc.", "Netflix Inc.", ]]]

In [None]:
faang_market_cap2

16. **Select the market cap of companies in position 1st, 100th, 200th, etc.**

    The S&P500 index contains 500 companies. Create a variable ``position_companies`` that contains the market cap of the companies in the positions:

    - 1st
    - 100th
    - 200th
    - 300th
    - 400th
    - 500th
    
    **Important!** This selection should be done under ``market_cap``. Don't use ``symbols`` for this particular activity.

In [None]:
position_companies = market_cap.iloc[[0,99,199,299,399,499]]
position_companies

## Sorting Series

17. **What's the 4th company sorted lexicographically by their symbol?**

    Use the ``symbols`` series to sort **the symbols** in lexicographical order (ascending). Which company (the name, the index value) appears in the 4th position? Note: the answer is the full company name. For example, the full name of ``MSFT`` (Microsoft) is ``Microsoft Corp.``, as it appears in the index. The correct answer would be Microsoft Corp.. By the way, Microsoft is definitively NOT the correct answer.

In [None]:
symbols.sort_values(ascending=True).head()

18. **What's the Market Cap of the 7th company (in descending order)?**

Using the ``market_cap`` series, sort the companies by their symbol in lexicographical order in **descending mode** and enter the revenue of the 7th company.

In [None]:
market_cap.sort_index(ascending=False)[:7]

# Practice Series Filtering

## Introduction

In this lab, we will practice filtering with conditionals and sorting on pandas series using dataset that contains information about international cricket players who have played since 2002. The data includes the player's name, number of innings they have played, number of runs they have scored, number of balls they have faced, number of times they have been dismissed, their batting average, their strike rate, their highest score, number of fours they have hit, number of sixes they have hit, number of times they have scored a half-century, and number of times they have scored a century.

Below are the columns of the dataset:

- Player: Name of the player
- I: Number of innings played
- R: Number of runs scored
- B: Number of balls faced
- Outs: Number of times dismissed
- Avg: Batting average
- SR: Strike rate
- HS: Highest score
- 4s: Number of fours hit
- 6s: Number of sixes hit
- 50: Number of times scored a half-century
- 100: Number of times scored a century

Let's get started with the lab now!

In [None]:
import pandas as pd

In [None]:
data = pd.read_csv("leadersdata.csv")
data

In [None]:
data.columns

In [None]:
data.set_index('Player', inplace=True)

In [None]:
# Creating pandas series for each column
innings = data['I']
runs = data['R']
balls = data['B']
outs = data['Outs']
batting_average = data['Avg']
strike_rate = data['SR']
highest_score = data['HS']
number_of_fours = data['4s']
number_of_sixes = data['6s']
number_of_fifties = data['50']
number_of_hundreds = data['100']

In [None]:
# Printing the first 5 rows of each series
print("Innings:\n", innings.head())
print("Runs:\n", runs.head())
print("Balls:\n", balls.head())
print("Outs:\n", outs.head())
print("Batting Average:\n", batting_average.head())
print("Strike Rate:\n", strike_rate.head())
print("Highest Score:\n", highest_score.head())
print("Number of Fours:\n", number_of_fours.head())
print("Number of Sixes:\n", number_of_sixes.head())
print("Number of Fifties:\n", number_of_fifties.head())
print("Number of Hundreds:\n", number_of_hundreds.head())

## Activities

1. **How many players have a batting average greater than 30 in the ``batting_average`` series**

In [None]:
batting_average.loc[lambda x : x > 30].shape

In [None]:
len(batting_average[batting_average > 30])

2. **What is the maximum number of runs scored by a player in the ``runs`` series**

In [None]:
runs.max()

3. **Name the player with maximum runs**

    Write the name of the player who has scored the maximum number of runs.

In [None]:
runs.sort_values(ascending=False)[:1]

In [None]:
runs[runs == runs.max()]

4. **Name the player who played least number of balls**

    There is a possibility that more than one player has played the least number of balls. In that case, write the names of first and last players names separated by a comma. For example, write ``A, B``.

In [None]:
balls[balls == balls.min()]

5. **How many players have played more than 500 balls in the ``balls`` series**

In [None]:
len(balls[balls > 500])

6. **What is the mean value of the batting_average series**

    Write your answer in the form of a number with 2 decimal places. For example, if the answer is 1.234567, write 1.23.

In [None]:
batting_average.mean()

7. **How many players have a strike rate not equal to 70 in the ``strike_rate`` series**

In [None]:
len(strike_rate[strike_rate != 70])

8. **What is the minimum number of innings played by a player in the ``innings`` series**

In [None]:
innings.min()

9. **How many players have a batting average greater than 50 in the ``batting_average`` series**

In [None]:
len(batting_average[batting_average > 50])

10. **How many players have a batting average between 20 and 30 (inclusive) in the ``batting_average`` series**

In [None]:
len(batting_average.loc[lambda x : (x >= 20) & (x <= 30)])

11. **Calculating the Average Balls Faced by a Player**

    The ``balls`` series contains information about the number of balls faced by different players in a cricket match. The task is to calculate the average number of balls faced by a player.

    Round off the result to two decimal places. For example, if the answer is 123.456789, write 123.46.

In [None]:
balls.mean()

12. **How many players have a strike rate greater than 120 in the ``strike_rate`` series**

In [None]:
len(strike_rate[strike_rate > 120])

13. **Provide the names of the top three players from the ``strike_rate`` series**

    Write the names of the players in the decreasing order of their strike rate separated by a comma. For example, write ``A, B, C``.

In [None]:
strike_rate.sort_values(ascending=False).head(3)

14. **Sum of Maximums from ``number_of_fours`` and ``number_of_sixes`` Series**

    The goal is to calculate the sum of the maximum values from both series combined. For example, if the maximum value in ``number_of_fours`` is ``10`` and the maximum value in ``number_of_sixes`` is ``20``, then the answer is ``30``.

In [None]:
number_of_fours.max() + number_of_sixes.max()

15. **How many players have a batting average below ``10`` in the ``batting_average`` series**

In [None]:
len(batting_average[batting_average < 10])

16. **Name the player who hit maximum sixes**

    Write the player name along with the number of sixes hit by the player separated by a comma. For example, write ``A, 10``.

In [None]:
number_of_sixes[number_of_sixes == number_of_sixes.max()]

17. **How many players have a strike rate between 80 and 90 (inclusive) in the ``strike_rate`` series**

In [None]:
len(strike_rate[(strike_rate >= 80) & (strike_rate <= 90)])

18. **What is the total number of runs scored by all players in the ``runs`` series**

In [None]:
runs.sum()

19. **What is the range (difference between the maximum and minimum values) of the ``number_of_fifties`` series**

In [None]:
number_of_fifties.max() - number_of_fifties.min()

20. **How many players have a strike rate below 60 in the ``strike_rate`` series**

In [None]:
len(strike_rate[strike_rate < 60])

21. **Calculating the Mean Number of Boundaries (Fours + Sixes) Hit by a Player**

    In this activity, you will be calculating the mean number of boundaries hit by a player. Boundaries include both fours and sixes. You will be given two series: ``number_of_fours`` and ``number_of_sixes``. Your task is to find the mean number of boundaries hit by combining the values from both series.

    > Remember to round your answer to two decimal places. For example, if the mean number of boundaries is 1.453, write 1.45.

In [None]:
(number_of_fours + number_of_sixes).mean()

22. Players with highest score in ``highest_score`` series

    Create a new series named ``top_five_scores`` that contains the top five players names with the highest score in the ``highest_score`` series. The series should be sorted in descending order based on the scores.

In [None]:
top_five_scores = highest_score.sort_values(ascending=False)[:5]

In [None]:
top_five_scores

## Filtering and Conditional Selection with Series

### Introdution

Now it's time to practice conditional selection with Series. We're going to use the same data as before with Companies' revenues.

Conditional Selection is like "filtering" or "querying" (if you're familiar with SQL). It'll allow us to answer the following types of questions:

- what companies made more than ``$X``?
- what companies made less than ``$X``?
- what companies made between ``$X`` and ``$Y``?

Turn on your lab, and let's get started!

### Boolean arrays

We're going to start introducing the concept of Boolean Arrays (which in turn, is a concept from NumPy, but you're not required to know NumPy to complete this).

This concept might sound a little bit strange at the beginning, but trust us, it'll all make sense in the next section.

Boolean Arrays is a way of selecting in which we pass the **full index** of the series, and we indicate what elements we want to select and which ones we want to skip. We indicate this by passing *Boolean* values: ``True`` and ``False``.

Let's see an example to make it more clear. We're going to use Boolean Arrays to select only American companies. That is: Apple, Alphabet, Microsoft, Dell, Meta, Intel and IBM.

Using Boolean Arrays, we need to pass the value ``True`` for each one of those companies, and ``False`` to all the remaining ones.

Check the example in the notebook to see it in action, but the syntax is basically:

```py
s.loc[[
    True,      # Apple
    False,     # Samsung
    True,      # Alphabet
    False,     # Foxconn
    True,      # Microsoft
    False,     # Huawei
    True,      # Dell
    True,      # Meta
    False,     # Sony
    False,     # Hitachi
    True,      # Intel
    True,      # IBM
    False,     # Tencent
    False,     # Panasonic
]]
```

Please note that the list (or array) of boolean values passed must be of EQUAL length as the series index. We must pass a value, either ``True`` or ``False`` for ALL the values in the index. If we are not interested in selecting an element, we just pass ``False``.

Now, this might feel like an "inefficient" way of selecting data. What happens if you have 1 million records? Are you supposed to type ``True`` or ``False`` for each one of those records? Of course not! In the next section it'll become more clear why Boolean Arrays are important.

But first, it's your turn to practice Boolean Arrays and selection:

In [None]:
import pandas as pd

In [None]:
companies = [
    'Apple', 'Samsung', 'Alphabet', 'Foxconn',
    'Microsoft', 'Huawei', 'Dell Technologies',
    'Meta', 'Sony', 'Hitachi', 'Intel',
    'IBM', 'Tencent', 'Panasonic'
]

In [None]:
s = pd.Series([
    274515, 200734, 182527, 181945, 143015,
    129184, 92224, 85965, 84893, 82345,
    77867, 73620, 69864, 63191],
    index=companies,
    name="Top Technology Companies by Revenue")

In [None]:
s

### Boolean Arrays

In [None]:
s.loc[[
    True,      # Apple
    False,     # Samsung
    True,      # Alphabet
    False,     # Foxconn
    True,      # Microsoft
    False,     # Huawei
    True,      # Dell
    True,      # Meta
    False,     # Sony
    False,     # Hitachi
    True,      # Intel
    True,      # IBM
    False,     # Tencent
    False,     # Panasonic
]]

1. Select only the Japanese companies

    Create a Boolean Array that will select only the Japanese companies in our Series:

    - Sony
    - Hitachi
    - Panasonic
    
    Store the array in the variable ``japanese_boolean_array``.

    Using that same array, select the companies from the Series and store them in a different variable named ``japanese_companies``.

In [None]:
japanese_boolean_array = [
    False,      # Apple
    False,     # Samsung
    False,      # Alphabet
    False,     # Foxconn
    False,      # Microsoft
    False,     # Huawei
    False,      # Dell
    False,      # Meta
    True,     # Sony
    True,     # Hitachi
    False,      # Intel
    False,      # IBM
    False,     # Tencent
    True,     # Panasonic
]

In [None]:
japanese_companies = s.loc[japanese_boolean_array]
japanese_companies

### Conditional Selection

Now it's when those Boolean Arrays will be really useful (and hopefully, finally click).

Turns out that Series accept comparison operators (or boolean operators), like "greater than" (``>``), "less than" (``<``), etc. The interesting feature is that, the result of applying any of these operators to a Series, is a boolean array!.

Let's see an example: in the notebook, you have an example of the expression ``s > 100_000``. This basically asks which values of the series are "greater than" 100,000 (which in turns means how many companies's revenue are greater than $100 billion).

The result of that expression is the boolean *series*:

```py
Apple                 True
Samsung               True
Alphabet              True
Foxconn               True
Microsoft             True
Huawei                True
Dell Technologies    False
Meta                 False
Sony                 False
Hitachi              False
Intel                False
IBM                  False
Tencent              False
Panasonic            False
```

We can combine this "conditional" expression, with the selection method seen before to put together a very powerful filtering and querying system.

Example, let's ask:

> **What are the companies which revenues exceed the $100 billion dollars?**

We just need to combine the ``.loc`` expression with our boolean array:

```py
s.loc[s > 100_000]
Apple        274515
Samsung      200734
Alphabet     182527
Foxconn      181945
Microsoft    143015
Huawei       129184
Name: Top Technology Companies by Revenue, dtype: int64
```

We can use any operator that we want: equals (``==``), different from (or not equals to ``!=``), greater than (``>``), greater than or equals to (``>=``), etc.

Give it a try, complete the following activities:

2. **Select companies with less than $90,000M in Revenue**

    Select those companies that have a revenue value less than ``90,000``, select them in a new variable named ``less_90_rev``.

In [None]:
less_90_rev = s.loc[s < 90000]
less_90_rev

3. Select companies with revenue of more than $150,000M

    Select those companies that have a revenue value greater than ``150,000``, select them in a new variable named ``more_150_rev``.

In [None]:
more_150_rev = s.loc[s > 150000]
more_150_rev

### Combining Series methods with comparison operators

This should feel natural based on what we saw in the previous section, but it's still worth mentioning. You can combine comparison operators with Series methods to obtain more generic expressions. For example, let's select the company with the MOST revenue:

``s.loc[s == s.max()]``

Or we could find those companies with revenue above the average:

``s.loc[s >= s.mean()]``

Or, a more complex expression could be finding those companies who's revenue is greater than the average + one standard deviation (these concepts are covered in our Descriptive Statistics track; don't worry about the technical details now):

``s.loc[s > (s.mean() + s.std())]``

#### Company with the most revenue

In [None]:
s.max()

In [None]:
s.loc[s == s.max()]

#### Company with revenue above average

In [None]:
s.mean()

In [None]:
s.loc[s >= s.mean()]

#### Companies who's revenue is greater than the average +1 standard deviation

In [None]:
s.loc[s > (s.mean() + s.std())]

### Boolean operators

Boolean operators are the and, or, not expressions used to "concatenate" conditions. They should be familiar from your Python background. In Pandas, we also have boolean operators that we can use to create more "complex" selection expressions, but they're not ``and``, ``or``, ``not`` as in Python, they are:

- ``&`` for AND
- ``|`` for OR
- ``~`` for NOT

Let's see them in action using the *OR* operator (``|``). We will compute the expression that selects the companies that have revenue greater than ``$150,000M`` **OR** less than ``$80,000``. Graphically, we want to select the following companies:

Let's treat each expression separately; first, let's focus on those companies with revenue greater than $150,000:

```py
>>> s > 150_000
```
(see the result in the notebook)

Then, those companies with revenue less than $80,000M:

```py
>>> s < 80_000
```
(see the result in the notebook)

Now, let's put it altogether, using the | (OR) operator. IMPORTANT! When we combine comparison expressions using boolean operators, we must surround each expression in parentheses:

```py
>>> (s > 150_000) | (s < 80_000)
Apple                 True
Samsung               True
Alphabet              True
Foxconn               True
Microsoft            False
Huawei               False
Dell Technologies    False
Meta                 False
Sony                 False
Hitachi              False
Intel                 True
IBM                   True
Tencent               True
Panasonic             True
```

You can see the True values matching the same desired values as the image above.

#### Companies with revenue greater than $150,000M or less than $80,000M

##### Revenue greater than $150,000M

In [None]:
s > 150_000

##### Revenue less than $80,000M

In [None]:
s < 80_000

##### Putting all together

In [None]:
(s > 150_000) | (s < 80_000)

##### Selecting the companies matching the expression:

In [None]:
s.loc[(s  > 150_000)|(s < 80_000)]

##### The NOT (~) operator

In [None]:
s.loc[s >= 150_000]

In [None]:
s.loc[~ (s >= 150_000 )]

4. **Select companies the companies with the MOST and LESS revenue**

In [None]:
s.loc[ (s == s.max()) | (s == s.min()) ]

5. **Select companies with revenue between ``$80,000M`` and ``$150,000M``**

In [None]:
s.loc[(s < 150_000) & (s > 80_000)]

# Practicing Series Filtering with S&P500 and Census Data

In this project you'll practice your Series filtering skills.

Before we get started, let's introduce the datasets used. Make sure your lab is running!

## Datasets

Both datasets used for this project were taken from the publicly available and Open Source RDatasets repository.

### Age of First Marriage

The first one is titled **Age at first marriage of 5,534 US women** ([source](https://vincentarelbundock.github.io/Rdatasets/doc/openintro/age_at_mar.html)). It reads:

> **Age at first marriage of 5,534 US women who responded to the National Survey of Family Growth (NSFG) conducted by the CDC in the 2006 and 2010 cycle.**

There are a total of 5,534 observations.

### S&P500 Returns (1990's)

The second one is titled **Returns of the Standard and Poors 500** ([source](https://vincentarelbundock.github.io/Rdatasets/doc/MASS/SP500.html)) contains daily returns for S&P500 in the 1990's (1991-1999). It contains 2,780 values.

## Reading the data

We can use the pandas built-in ``read_csv`` method to read the data that is stored in CSV format. Most commonly, ``read_csv`` is used to read data into DataFrames, but as this project deals with ``Series``, we pass the parameter ``squeeze=True`` to make it a series. Bottom line is: don't worry about it for now, both datasets should be available for you in the variables ``age_marriage`` and ``sp500``.

We can also display a quick histogram about our data to understand how it is distributed. This is completely optional.

In [None]:
import pandas as pd
# for visualizations, don't worry about these for now
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle

In [None]:
age_marriage = pd.read_csv("age_at_mar.csv", index_col=0).squeeze('columns')
age_marriage.head() 

In [None]:
age_marriage.shape

In [None]:
fig, ax = plt.subplots(figsize=(14,7))
sns.histplot(age_marriage, ax=ax);

### S&P Returns 1990's

In [None]:
sp500 = pd.read_csv('SP500.csv', index_col=0).squeeze('columns')
sp500.head()

In [None]:
sp500.shape

In [None]:
fig, ax = plt.subplots(figsize=(14,7))
sns.histplot(sp500, ax=ax);

## Activities

1. Rename the series accordingly

    Rename both series with the names specified below, given their variables:

    - ``age_marriage``: should be named "Age of First Marriage"
    - ``sp500``: should be named "S&P500 Returns 90s"

In [None]:
age_marriage.name = "Age of First Marriage"
sp500.name = "S&P500 Returns 90s"

2. What's the maximum Age of marriage?

In [None]:
age_marriage.max()

3. What's the median Age of Marriage?

In [None]:
age_marriage.median()

4. What's the minimum return from S&P500?

    Enter the value with up 2 decimals of precision. Example, if the value is ``-11.8718``, enter only ``-11.87``.

In [None]:
sp500.min()

5. How many Women marry at age 21?

    21 is the most common age for marriage (you can check that using the .mode() method). How many women married at that age?

In [None]:
len(age_marriage.loc[age_marriage == age_marriage.mode().iloc[0]])

6. How many Women marry at 39y/o or older?

In [None]:
len(age_marriage.loc[age_marriage >= 39])

7. How many positive S&P500 returns are there?

    That is, a return greater than 0.

In [None]:
len(sp500.loc[sp500 > 0])

In [None]:
ax = sns.histplot(sp500)
ax.axvline(0, color='red')

8. How many returns are less or equals than -2?

In [None]:
len(sp500.loc[sp500 <= -2])

In [None]:
ax = sns.histplot(sp500)
ax.axvline(-2, color='red')

### Advanced Selection with Boolean Operators

Now it's time to combine conditionals using boolean operators to create more advanced filters. This time, we'll ask you to define new variables that will be checked dynamically.

9. Select all women below 20 or above 39

    Perform a selection of all the values in ``age_marriage`` that are below ``20`` or above ``39``. Store your results in the variable ``age_20_39``.

In [None]:
ig, ax = plt.subplots(figsize=(14, 7))
sns.histplot(age_marriage, ax=ax)
ax.add_patch(Rectangle((10, 0), 9, 450, alpha=.3, color='red'))
ax.add_patch(Rectangle((39, 0), 5, 450, alpha=.3, color='red'))

In [None]:
age_20_39 = age_marriage.loc[(age_marriage < 20) | (age_marriage > 39)]

10. Select all women whose ages are even, and are older than 30 y/o

    Perform a selection of all the values that are greater than ``30`` and even. Store your result in the variable ``age_30_even``.

In [None]:
age_30_even = age_marriage.loc[~(age_marriage % 2) & (age_marriage > 30)]
age_30_even.head()

10. Select the S&P500 returns between 1.5 and 3
    The ones depicted below:

In [None]:
fig, ax = plt.subplots(figsize=(14, 7))
sns.histplot(sp500, ax=ax)
ax.add_patch(Rectangle((1, 0), 1.5, 250, alpha=.3, color='red'))

In [None]:
sp_15_to_3 = sp500.loc[(sp500 > 1.5) & (sp500 < 3)]

In [None]:
sp_15_to_3.head()

# Vectorized Operations with Series

## Introduction

In this brief project, we'll learn about "Vectorized Operations". In particular, we'll learn about Vectorized Operations applied to Pandas Series; but in reality, they're a concept original from NumPy, and we'll use it A LOT with DataFrames.

So, the examples we'll see here might look trivial, but trust us that they'll be very useful throughout all your Pandas journey.

Let's get started!

## Understanding Vectorized Operations

Vectorized Operations means just applying a "global" function to an entire Series. Let's derive an example from a Spreadsheet, in which we create a new column by applying an operation to ANOTHER column:

With Series, it's going to be pretty much the same, it might look even simpler. Start the lab if you haven't already and take a look at the first operations.

First, we initialize the Series we've been using, this time we name it ``revenue_in_millions``. That's the same series we've used so far, and it captures the revenue of the companies (listed in the Index) in Millions of dollars.

We then create **A NEW** Series named ``revenue_in_billions`` just by dividing the whole Series by ``1000``:


```py
>>> revenue_in_billions = revenue_in_millions / 1000

Apple                274.515
Samsung              200.734
Alphabet             182.527
Foxconn              181.945
Microsoft            143.015
Huawei               129.184
Dell Technologies     92.224
Meta                  85.965
Sony                  84.893
Hitachi               82.345
Intel                 77.867
IBM                   73.620
Tencent               69.864
Panasonic             63.191
```

That's it! That's a vectorized operation. We say it's "vectorized" because it doesn't act on just 1 value, but in the whole *vector* of values contained in the Series.

### Available Operators

For now, we'll mostly focus on the regular arithmetic operators: ``+``, ``-``, ``*``, ``/``, ``**``, etc. But you'll see in further labs that we can create vectorized operations with String operations or even our own custom functions.

### Practice time

Now it's your turn to practice some vectorized operations before we advance to the following section.

In [None]:
import pandas as pd

companies = [
    'Apple', 'Samsung', 'Alphabet', 'Foxconn',
    'Microsoft', 'Huawei', 'Dell Technologies',
    'Meta', 'Sony', 'Hitachi', 'Intel',
    'IBM', 'Tencent', 'Panasonic'
]

revenue_in_millions = pd.Series([
    274515, 200734, 182527, 181945, 143015,
    129184, 92224, 85965, 84893, 82345,
    77867, 73620, 69864, 63191],
    index=companies,
    name="Top Technology Companies by Revenue")

#### Understanding Vectorized Operations

In [None]:
revenue_in_billions = revenue_in_millions / 1000
revenue_in_billions

### Activities

1. Subtract $50B from all companies in ``revenue_in_billions``

    The recession just hit! Let's say you need to subtract *$50B* from all the companies in ``revenue_in_billions``. Store the new series in the variable ``revenue_recession``

In [None]:
revenue_recession = revenue_in_billions - 50

2. Create a new series expressing revenue in dollars (units)

    The accounting team needs more detail when calculating EBITDA. They need revenue expressed in dollar units (instead of millions or billions). Use either series ``revenue_in_millions`` or revenue_in_billions to create a new series ``revenue_in_dollars``.

In [None]:
revenue_in_dollars = revenue_in_millions * 1000000

### Operations between Series

If we keep the analogy of spreadsheets, you'll see that it's also possible to create operations between different Series. For example, let's say recession hits again and the revenue of all companies is affected; but not equally. We want to reduce the revenue of each company by a given percentage. For example, Apple's new revenue will be 91% of the original one, Samsung's 93%, etc.

Expressing that with Series, we first create the series ``recession_impact``, and apply the operation directly on the ``revenue_in_millions`` original series:

```py
>>> revenue_in_millions * recession_impact
Apple                249808.65
Samsung              186682.62
Alphabet             178876.46
Foxconn              176486.65
Microsoft            141584.85
Huawei               114973.76
Dell Technologies     80234.88
Meta                  70491.30
Sony                  78950.49
Hitachi               76580.85
Intel                 69301.63
IBM                   71411.40
Tencent               67768.08
Panasonic             59399.54
```

Now it's your turn to practice with Operations with Series

In [None]:
recession_impact = pd.Series([
    0.91, 0.93, 0.98, 0.97, 0.99, 0.89, 0.87,
    0.82, 0.93, 0.93, 0.89, 0.97, 0.97, 0.94], index=companies)
recession_impact

The result of applying the recession impact:

In [None]:
revenue_in_millions * recession_impact

We can calculate the dollar amount of the impact by combining multiple operations:

In [None]:
# Absolute impact in Millions
revenue_in_millions - (revenue_in_millions * recession_impact)

In [None]:
# Absolute impact in Billions
(revenue_in_millions - (revenue_in_millions * recession_impact)) / 1_000

3. Calculate revenue per employee, in dollars

    Using the series ``number_of_employees`` (given in the notebook), your job is to calculate revenue ``per`` employee, expressed in dollars (units). Store it in the variable ``revenue_per_employee``.

In [None]:
number_of_employees = pd.Series([
    164000, 266673, 150028, 1290000, 221000, 195000,
    165000, 71970, 109700, 368250, 121100, 282100, 112771, 240198
], index=companies)

In [None]:
revenue_per_employee = revenue_in_dollars / number_of_employees

# Practicing Series Vectorized Operations with Penguins Data

## Introduction

Now it's time to put your knowledge of vectorized operations on Pandas series to the test. In this lab, we will be working with a dataset that contains information about penguins. Each penguin is described by various attributes such as species, island, culmen length, culmen depth, flipper length, body mass, and gender.

In this lab we will practice vectorized operations on Pandas series. We will learn how to perform arithmetic operations on series and apply mathematical functions to series.

Throughout the lab, you will be presented with coding activities that require you to write code snippets to perform specific operations on the series. Each activity will be followed by a solution, allowing you to verify your code and understand the correct approach.

In addition to individual topic-based activities, there will also be mixed-topic activities that require you to combine different operations to achieve a specific outcome. These activities will test your ability to apply multiple concepts simultaneously.

By the end of this lab, you will have gained a solid understanding of vectorized operations on Pandas series and be able to manipulate and analyze data efficiently using these techniques.

Let's dive into the activities and explore the power of vectorized operations on Pandas series!

## Activities

In [None]:
import pandas as pd

In [None]:
# Read the dataset into a DataFrame
df = pd.read_csv('penguins_cleaned.csv')
df

In [None]:
# Convert all columns to pandas Series
species = df['species']
island = df['island']
culmen_length_mm = df['culmen_length_mm']
culmen_depth_mm = df['culmen_depth_mm']
flipper_length_mm = df['flipper_length_mm']
body_mass_g = df['body_mass_g']
gender = df['sex']

1. **Add a constant value of 100 to the ``body_mass_g`` series**

    Create a new series called ``body_mass_g_plus_100`` by adding a constant value of 100 to the ``body_mass_g`` series.

In [None]:
body_mass_g_plus_100 = body_mass_g + 100
body_mass_g_plus_100

2. **Subtract the ``culmen_length_mm`` series from the ``flipper_length_mm`` series**

    Subtract the ``culmen_length_mm`` series from the ``flipper_length_mm`` series and assign the result to a new series called ``length_difference``.

In [None]:
length_difference = flipper_length_mm - culmen_length_mm
length_difference

3. **Multiply the ``culmen_depth_mm`` series by 2**

    Multiply the ``culmen_depth_mm`` series by 2 and assign the result to a new series called ``double_culmen_depth_mm``.

In [None]:
double_culmen_depth_mm = culmen_depth_mm * 2

4. **Raise the ``flipper_length_mm`` series to the power of 2**
    
    Create a new series called ``flipper_length_mm_squared`` by raising the ``flipper_length_mm`` series to the power of 2.

In [None]:
flipper_length_mm_squared = flipper_length_mm ** 2
flipper_length_mm_squared

5. **Calculate the mean of the ``culmen_length_mm`` series and subtract it from each value in the series**

    Find the mean of the ``culmen_length_mm`` series and subtract it from each value in the series. Assign the result to a new series called ``culmen_length_mm_mean_centered``.

In [None]:
culmen_length_mm_mean_centered = culmen_length_mm - culmen_length_mm.mean()
culmen_length_mm_mean_centered

6. **Concatenate the ``species`` and ``gender`` series, separated by a hyphen ``-``**

    Create a new series called ``species_and_gender`` by concatenating the ``species`` and ``gender`` series, separated by a hyphen (``-``).

In [None]:
species_and_gender = species.str.cat(gender, sep='-')
species_and_gender

7. **Perform element-wise addition of ``culmen_length_mm`` and ``culmen_depth_mm``**

    Add ``culmen_length_mm`` and ``culmen_depth_mm`` together and assign the result to a new variable called ``culmen_length_plus_depth_mm``.

In [None]:
culmen_length_plus_depth_mm = culmen_depth_mm + culmen_length_mm
culmen_length_plus_depth_mm

8. **Sort ``culmen_length_mm`` in descending order**

    Create a new series called ``culmen_length_mm_sorted`` by sorting ``culmen_length_mm`` in descending order.

In [None]:
culmen_length_mm_sorted = culmen_length_mm.sort_values(ascending=False)
culmen_length_mm_sorted

9. **Divide ``flipper_length_mm`` by ``culmen_length_mm``**

    Find the ratio of each penguin's flipper length to its culmen length and assign the result to a new variable called ``length_ratio``.

In [None]:
length_ratio = flipper_length_mm / culmen_length_mm
length_ratio

# Series Practice: Vectorized operations using NBA data

It's time to put to practice our Vectorized Operations with Series.

The data we'll use is related to statistics of Players from the NBA since the year 1985. Although this practice is about Series, we'll start reading our data as a DataFrame. If you don't know what a DataFrame is yet, don't worry... this will actually be useful for the future. The only thing you need to know now is that each column of a DataFrame is a Series. And we're extracting several Series from the df:

```py
# Game info
games_played = df['G']
minutes_played = df['MP']

# Field Goals info
field_goals = df['FG']
field_goals_attempts = df['FGA']

# Free Throws info
free_throws = df['FT']
free_throws_attempts = df['FTA']
```

The index of the Series is the Player's name. So, for example, we can find the total field goals of Michael Jordan:

```py
field_goals.loc[field_goals.index == 'Michael Jordan*']
```

> ***The star (*) next to the player's name is because that player was selected for the "Hall of Fame" of the NBA.***

Now, let's get started with our practice!

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('nba_player_stats_1985.csv', index_col='Player')

In [None]:
df.head()

In [None]:
# Game info
games_played = df['G']
minutes_played = df['MP']

# Field Goals info
field_goals = df['FG']
field_goals_attempts = df['FGA']

# Free Throws info
free_throws = df['FT']
free_throws_attempts = df['FTA']

In [None]:
games_played.head()

In [None]:
field_goals.head()

In [None]:
field_goals_attempts.head()

Michael Jordan Field Goals:

In [None]:
field_goals.loc['Michael Jordan*']

## Arithmetic Operations

1. **Calculate field goal accuracy**

    Calculate the "Field Goal accuracy" of a player by dividing their field goals by their total attempts then multiply by 100. Store the result in the variable ``field_goal_perc``.

In [None]:
field_goal_perc = field_goals / field_goals_attempts * 100

2. **What's the FG% of Michael Jordan**

    Use the series created in the previous activity, field_goal_perc, to answer: what's the FG% of ``Michael Jordan``?

    > *Remember, MJ's name in this dataset is ``Michael Jordan*`` because he was (obviously) inducted in the HoF.*

    Enter your result with up to three decimal points(don't round-off). That is, if the value is ``0.618324``, enter ``0.618`` (including the ``0`` and the dot `.`).

In [None]:
field_goal_perc.loc['Michael Jordan*']

3. **Field goals per Game**

    Calculate "Field Goals per Game" using the series ``field_goals`` and games_played. Store your results in the variable ``field_goals_per_game``

In [None]:
games_played.head()

In [None]:
field_goals_per_game = field_goals / games_played 

In [None]:
field_goals_per_game.head()

4. **Which player has the highest 'Field Goal per Game' value?**

    All stars here...

In [None]:
field_goals_per_game.sort_values(ascending=False).head()

5. **Calculate 'Total Points'**

    In the NBA lingo, field goals account for all the "goals" scored by a player, **EXCEPT** free throws. So, if we want to calculate the total number of points scored by a player, we must add field goals and free throws. Field goals are a combination of 2-point and 3-point goals. For this exercise, you can safely assume that all "field goals" have a value of 2.

    Calculate Total Points scored by a player, by adding the series containing field goals and free throws. Store your results in the variable ``total_points``.

In [None]:
field_goals.head()

In [None]:
free_throws.head()

In [None]:
total_points = (field_goals * 2) + free_throws
total_points.head()

6. Who's the player with the most Total Points?

    Who's the player that, according to our dataset, has scored the most points in the NBA history?

In [None]:
total_points.sort_values(ascending=False).head()

7. Total Points per Minute

    Using the series that you previously calculated, ``total_points``, calculate "Total points per minute". Store your results in the variable ``points_per_minute``.

    > *Important. This activity relies on ``total_points``. Make sure you have completed that one correctly.*

In [None]:
points_per_minute = total_points / minutes_played

8. **Who has a better Points per Minute score; MJ or Kevin Durant?**

In [None]:
idxmax, valmax = points_per_minute.agg(['idxmax', 'max'])
print(idxmax, valmax)

In [None]:
points_per_minute.loc['Michael Jordan*'] > points_per_minute.loc['Kevin Durant']

9. **Calculate FT**

    FT is the proportion of scored Free Throws divided by the total attempts. Basically, the accuracy of Free Throws. Store your results in ``ft_perc``.

In [None]:
ft_perc = free_throws / free_throws_attempts
ft_perc

10. Who's the player with best FT% record: MJ or Larry Bird?

    A battle of titans. Who had a better FT% record?

In [None]:
ft_perc.loc['Michael Jordan*'] > ft_perc.loc['Larry Bird*']

## Boolean Operations

11. **Find the top 25% players by 'free throw accuracy'**

    Create a boolean series that contains ``True`` values for those players that are in the top 25% by free throw efficiency (using the preivously calculated) ``ft_perc`` series. Store your results in the variable ``ft_top_25``.

    Your result should look something like:

In [None]:
ft_perc.quantile(.75)

In [None]:
ft_perc.head()

In [None]:
ft_top_25 = ft_perc >= ft_perc.quantile(.75)
ft_top_25.head(10)

In [None]:
ft_top_25 = ft_perc.where(ft_perc > ft_perc.quantile(.75)) == ft_perc
ft_top_25.head(10)

12. How many players are in the top 25% by free throw accuracy?

    Answer using the previously calcualted series ``ft_top_25``.

In [None]:
ft_top_25.value_counts()

In [None]:
ft_top_25.sum()

13. Find those players that scored 0 points in their history

    Create a boolean series that contains ``True`` values for those players that have scored 0 total points. Store your results in the variable ``players_0_points``.

In [None]:
players_0_points = total_points == 0
players_0_points.sum()

# Exploring DataFrames with Currency Data

## Introduction

Welcome to the project on exploring currency data using pandas DataFrame! In this project, we will embark on an exciting journey of analyzing and gaining insights from a comprehensive dataset containing information about different currencies from around the world.

Throughout this project, we will focus on practicing fundamental techniques on pandas DataFrames to extract meaningful information and perform various analyses. We will work with a dataset that includes currency names, symbols, currency codes, countries of usage, and other relevant details.

Our exploration will cover several essential topics. Firstly, we will master basic navigation techniques using methods like ``head()``, ``tail()``, ``info()``, ``shape``, ``describe()``, ``nunique()``, and ``isnull()``. These methods will allow us to quickly inspect the dataset, understand its structure, and identify any missing values.

Next, we will learn different methods for selecting specific columns from the DataFrame. Using indexing with ``[]``, ``loc[]``, and ``iloc[]``, we will extract and analyze specific attributes of the currencies. This selective approach will enable us to focus on the aspects that interest us the most and perform targeted analyses.

Lastly, we will delve into slicing techniques to basic filter the DataFrame. Slicing will allow us to extract subsets of data and perform more detailed and granular analyses. By applying these slicing techniques, we can gain insights into specific groups of currencies or examine currencies based on their properties.

Through this project, you will gain hands-on experience with pandas DataFrames and develop a solid foundation in data exploration, manipulation, and analysis. By working with a real-world currency dataset, you will enhance your skills in working with diverse datasets and be well-prepared for future data science projects.

Get ready to embark on an exciting journey of exploring currency data with pandas DataFrame. Let's dive in and unlock the insights hidden within the world of currencies!

In [None]:
import pandas as pd
df = pd.read_csv('currencies.csv')
df

## Activities

1. Getting an Overview of the dataset

In [None]:
df.head(4)

2. Select the correct answers for dataset information

In [None]:
df['Digits'].dtype

In [None]:
df['Symbol'].dtype

3. Get the statistics of the dataset

In [None]:
df['Digits'].std()

In [None]:
df['Digits'].mean()

4. Count the number of unique currencies in the dataset

In [None]:
df.size

In [None]:
pd.unique(df['Code']).size

In [None]:
df.nunique()

5. Identify the number of missing values in each column

    Write the count of all missing values in the dataset.

In [None]:
df.isna().sum()

6. Determine the highest currency number in the dataset

    Write answer in the form of a float. For example, if the answer is 98, write 98.0.

In [None]:
df['Number'].max()

7. Select Currency Names

    Extract the ``'Name'`` column from the DataFrame df and assign it to a new variable ``names``.

In [None]:
names = df['Name']

8. Get the details of the 3rd row

    Extract the 3rd row from the DataFrame ``df`` and assign it to a new variable ``row_3``.

In [None]:
row_3 = df.iloc[2]

9. Select rows 10 to 15 (inclusive) from the DataFrame

    Select rows 10 to 15 (inclusive) from the DataFrame ``df`` and assign it to a new variable ``rows``.

    > *You were asked for rows, not index numbers.*

In [None]:
rows = df.iloc[9:15]

10. Extract Alternating Rows from DataFrame

    Your task is to extract the alternating rows from the DataFrame df, where "alternating" refers to the 1st, 3rd, 5th, and so on (not based on index order). Store these selected rows in a new variable named ``rows_every_other``.

In [None]:
rows_every_other = df.iloc[::2]

In [None]:
rows_every_other = df.iloc[range(0, len(df), 2)]

11. Select columns with indices 2, 4, and 5 from the DataFrame

    Select columns with indices 2, 4, and 5 from the DataFrame ``df`` and assign it to a new variable ``cols``.

In [None]:
cols = df.iloc[:,[2,4,5]]

12. Select the first three columns of the dataframe

    Select the first three columns of the dataframe ``df`` and assign it to a new variable ``cols_first_three``.

    > *Dataframe columns also include ``index column``.*

    The expected output looks like this (it contains more rows than below):

In [None]:
cols_first_three = df.iloc[:,0:2]

# Intro to Pandas DataFrames

## Introduction

### What is a data frame?

Simply put, DataFrames are Python's way of displaying data in tablular form.

By using Python's powerful library for Data Analysis - pandas with DataFrames it offers us as users an efficient way to work with large amounts of structured data.

Is it similar to Excel?

Just as Excel use spreadsheets, Python uses pandas dataframes.

> **Python is often preferred over Excel due to its scalability and speed.**

#### How is a DataFrame composed?

So what does a DataFrame look like? It is made up of rows and columns making it two dimensional. Let's take a look below:

We see that our Index is made up of the companies, however an index can also be made up of numbers.

> **An ``Index`` is like an address, it can be used to locate both rows and columns.**

More on that later.

**So, let's jump into our Jupyter notebook and create the DataFrame from this example.**

The dataset for today's lab contains information on the Top Tech Companies in the World as shown below:

Our DataFrame contains five columns: - ``Revenue`` - ``Employees`` - ``Sector`` - ``Founding`` ``Date`` - ``Country``

The syntax to create a dataframe is:

```py
import pandas as pd
pd.DataFrame(data, index)
```

- **data**: These are the values from our dataset.
- **index**: This is like the address of the data we are storing.

## Basic Navigation & Browsing Techniques

So, how do we truely harness the power of pandas DataFrames?

Let's explore some functions:

### head

The ``head()`` method displays the first few rows of your DataFrame, making it easier to get a sense of the overall structure and content. It will display the first five rows by default.

```py
df.head()
```

We can expand our dataset further and specify the number of rows ``.head(n)`` within the brackets, as shown below:

```py
df.head(10)
```

This function is especially useful as you can quickly inspect your DataFrame using the ``head()`` method to ensure that all of the data is stored correctly and as expected.

### tail

The ``tail()`` method is similar to ``head()`` except it displays the last few rows of your DataFrame.

```py
df.tail()
```
Also similar to the ``head()`` we can specify the number of rows we want to display with ``.tail(n)``.

```py
df.tail(10)
```

This is useful for quickly identifying any problems with your dataset, as any errors will most likely be found at the end rather than the beginning.

### info

The ``info()`` method returns a list of all the columns in your DataFrame, along with their names, data types, number of values, and memory usage.

```py
df.info()
```

This makes it easy to gain insight into how much space is being taken up by each column and can help identify potential problems such as missing values or incorrect data types.

### shape

The shape method returns a tuple with the number of rows and columns (rows, columns) in our DataFrame.

```py
df.shape
```

This gives us a quick insight into the dimensionality of our DataFrame.

### describe

The ``describe()`` method displays descriptive statistics for numerical columns in your DataFrame, including the mean, median, standard deviation, minimum and maximum values.

```py
df.describe()
```

This can be very useful for understanding the distribution of values across a specific dataset or column without having to manually calculate each statistic.

### nunique

The ``nunique()`` method counts the number of distinct elements.

```py
df.nunique()
```

This can be very useful for understanding the number of categories we have in a column for example.

### isnull

The ``isnull()`` method detected missing values by creating a DataFrame object with a boolean value of ``True`` for ``NULL`` values and otherwise ``False``.

```py
df.isnull()
```

We can take this one step further by applying the ``sum()`` function to get a total number of ``NULL`` values in our DataFrame.

```py
df.isnull().sum()
```

## Column Selection

Before we begin this section it is important we can differenciate between a Series and a ``DataFrame``.

- A ``Series`` is a single column in a DataFrame
- A ``DataFrame`` is an entire table of data.

Let's take a look at how we can choose these items:

### Select One Column

The ``[]`` operator can be used to select a specific column within a DataFrame. The output is a ``Series``.

```py
df['column_name']
```

### Select One Column and Apply Methods

DataFrame's also allow the user to apply methods on columns, these functions include ``sum()``, ``mean()``, ``min()``, ``max()``, ``median()`` and more.

```py
df['column_name'].sum()
```

### Select Multiple Columns

To select multiple columns, use the ``[]`` operator with a ``list`` of column names as the argument. This creates another ``DataFrame``.

```py
df[['column_name_1', 'column_name_2','column_name_3']]
```

We can save this new DataFrame under a new variable name so that we can come back to it later.

```py
new_df = df[['column_name_1', 'column_name_2','column_name_3']]
```

### Select Multiple Columns and Apply Methods

We can take a shortcut and apply methods on more than one column at the same time.

```py
df[['column_name_1', 'column_name_2','column_name_3']].mean()
```

In [None]:
import pandas as pd

In [None]:
df = pd.DataFrame()

In [None]:
# Lists of data
data = {'Revenue': [274515,200734,182527,181945,143015,129184,92224,85965,84893,
                    82345,77867,73620,69864,63191],
        'Employees': [147000,267937,135301,878429,163000,197000,158000,58604,
                      109700,350864,110600,364800,85858,243540],
        'Sector': ['Consumer Electronics','Consumer Electronics','Software Services',
                   'Chip Manufacturing','Software Services','Consumer Electronics',
                   'Consumer Electronics','Software Services','Consumer Electronics',
                   'Consumer Electronics','Chip Manufacturing','Software Services',
                   'Software Services','Consumer Electronics'],
        'Founding Date':['01-04-1976','13-01-1969','04-09-1998','20-02-1974',
                         '04-04-1975','15-09-1987','01-02-1984','04-02-2004',
                         '07-04-1946','01-01-1910','18-07-1968','16-06-1911',
                         '11-11-1998','07-03-1918'],
        'Country':['USA','South Korea','USA','Taiwan','USA','China','USA','USA',
                   'Japan','Japan','USA','USA','China','Japan']} 
index = ['Apple','Samsung','Alphabet','Foxconn','Microsoft','Huawei',
         'Dell Technologies','Meta','Sony','Hitachi','Intel','IBM',
         'Tencent','Panasonic']

In [None]:
df = pd.DataFrame(data, index)

In [None]:
df

1. Output the first four rows of the df using the head function

    Now it's your turn! Try outputting the first four rows of the dataframe using the head function. Store the result in the variable ``head_first_4``.

In [None]:
head_first_4 = df.head(4)

2. Output the last six rows of the df using the tail function.

    Now try output the last six rows of the dataframe using the tail function. Store the result in the variable ``tail_last_6``.

In [None]:
tail_last_6 = df.tail(6)

3. Select the column Employees

    Let's practice these skills, select the column ``Employees`` into the variable ``employees_s``. You'll notice that the result of this selection is a ``Series``.

In [None]:
employees_d = df['Employees']

4. Output the median Employees to the nearest whole number

    Now, take it one step further and find the median of each row for the column ``Employees``. Store the result in the variable ``employees_median``

In [None]:
employees_median = df['Employees'].median()

5. Calculate the mean for columns Revenue and Employees

    Lastly, let's calculate the mean for the columns ``Revenue`` and ``Employees``. Store the result in the variable ``r_e_mean``.

    Your result should be a Series, and it should look something like:

    ```py
    Revenue      XXX
    Employees    YYY
    dtype: float64
    ```

    > *Round off the mean to the nearest whole number.*

In [None]:
r_e_mean = df[['Revenue', 'Employees']].mean()

In [None]:
r_e_mean

## Selection by Index

### Selection by Index - ``loc``

Index selection ``.loc`` is a Python DataFrames method that allows users to select DataFrame rows and columns by their **labels or integer positions**.

> ***It is most commonly used when a user needs to access specific elements within a DataFrame, such as selecting all rows with a specific label or values in a specific column.***

```py
df.loc[row_label, column_label]
```

We can use ``:`` in place of ``row_label`` or ``column_label`` to call all the data.

```py
df.loc[:, column_label]
df.loc[row_label,:]
```

We can also pass multiple columns in place of ``column_label`` or multiple rows in place of ``row_label``.

```py
df.loc[['row_name_1', 'row_name_2','row_name_3'], column_label]
df.loc[row_label,['column_name_1', 'column_name_2','column_name_3']]
```

Slicing is a powerful feature of pandas that enables us to access specific parts of our DataFrame.

> ``start:stop:step``

If we don't specify the step, the default value is 1.

### With column's
```py
df.loc[`row_label`, `column_name_start`:`column_name_stop`]
```
### With rows's
```py
df.loc[`row_name_start`:`row_name_stop` , `column_label`]
```
### With step
```py
df.loc[`row_label`, `column_name_start`:`column_name_stop`:n]

df.loc[`row_name_start`:`row_name_stop`:n , `column_label`]
```
### With step and :
```py
df.loc[:`, `column_name_start`:`column_name_stop`:n]

df.loc[`row_name_start`:`row_name_stop`:n , :]
```

In [None]:
# Find the revenue for Samsung 

# loc[row_label, column_label]

df.loc['Samsung','Revenue']

In [None]:
# loc[row_label, column_label]

df.loc[:,'Revenue']

In [None]:
# row_label
df.loc['Samsung',:]

In [None]:
# column_label
df.loc[:, 'Revenue']

In [None]:
# Multiple columns
df.loc[['Apple','Samsung','Sony'], 'Revenue']

In [None]:
# Multiple rows
df.loc['Apple', ['Employees','Country']]

In [None]:
rows = ['Apple','Samsung','Sony']
columns = ['Employees','Sector','Country']

In [None]:
# loc[row_label, column_label]

df.loc[rows,columns]

Slicing ``start:stop:step``

With columns

In [None]:
df.loc['Apple', 'Employees':'Founding Date']

With ``rows``

In [None]:
df.loc['Apple':'Sony', 'Employees']

With ``step``

In [None]:
df.loc['Apple':'Sony':2, columns]

With ``step`` and ``:``

In [None]:
df.loc['Apple':'Sony':2, :]

6. Select the Revenue, Employees & Sector for the companies Apple, Alphabet and Microsoft

    Now let's leverage your ``.loc`` selection skills. Your task is to select the columns ``Revenue``, ``Employees`` & ``Sector`` for the companies Apple, Alphabet and Microsoft. Your result should be stored in a variable ``index_selection`` and it should be a DataFrame looking something like:

In [None]:
index_selection = df.loc[['Apple','Alphabet','Microsoft'],['Revenue','Employees','Sector']]

## Selection by Position

Selection by position ``.iloc`` is a useful Python DataFrames method that allows users to select rows and columns of a DataFrame **based on their integer positions**.

> ***This is especially useful when users need to access elements within a DataFrame that do not have labels or specific column names.***

```py
df.iloc[row_position, column_position]
```

We can use ``:`` in place of ``row_position`` or ``column_position`` to call all the data.

```py
df.iloc[:, column_position]

df.iloc[row_position,:]
```

We can also pass multiple columns in place of ``column_position`` or multiple rows in place of ``row_position``.

```py
df.iloc[['row_position_1', 'row_position_2','row_position_3'], column_position]

df.iloc[row_position,['column_position_1', 'column_position_2','column_position_3']]
```

**Slicing** is a powerful feature of pandas that enables us to access specific parts of our ``DataFrame``.

> ``start:stop:step``

### With column's
```py
df.iloc[`row_position`, `column_position_start`:`column_position_stop`]
```

### With rows's
```py
df.iloc[`row_position_start`:`row_position_stop` , `column_position`]
```

### With step
```py
df.iloc[`row_position`, `column_position_start`:`column_position_stop`:n]

df.iloc[`row_position_start`:`row_position_stop`:n , `column_position`]
```

### With step and :
```py
df.iloc[:`, `column_position_start`:`column_position_stop`:n]

df.iloc[`row_position_start`:`row_position_stop`:n , :]
```

In [None]:
# Find the revenue for Samsung 
df.iloc[1, 0]

Notice if we use ``:`` in place of ``row_position``, it will again return all the data from the specified column.

Thus, we have a ``Series``

In [None]:
df.iloc[:,0]

Let's now use ``:`` in place of ``row_position`` or ``column_position``

In [None]:
# row_position
df.iloc[1,:]

In [None]:
# column_position
df.iloc[:,0]

Let's select a ``list`` of values this time:

In [None]:
# Multiple columns
df.iloc[[0,1,8], 0]

In [None]:
# Multiple rows
df.iloc[0, [1,4]]

In [None]:
rows_i = [0,1,8]
columns_i = [1,2,4]

In [None]:
df.iloc[rows_i,columns_i]

Slicing start:stop:step:

### With columns

In [None]:
df.iloc[0, 1:4]

### With rows

In [None]:
df.iloc[0:8, 1]

### With step

In [None]:
df.iloc[0:9:2, columns_i]

### With ``step`` & ``:``

In [None]:
df.iloc[0:9:2, :]

7. ***Using Position Selection, select the Revenue, Employees & Country for the companies ``Samsung``, ``Foxconn`` and ``Huawei``.***

In [None]:
df.iloc[[1,3,5],[0,1,4]]

# Mastering DataFrame Mutations with Hollywood data

## Introduction

In this lab, you'll engage with an actual dataset, applying various data manipulation techniques. Our focus will be a movie dataset, including information like the film title, release year, budget, gross earnings, and more.

This lab will cover the following:

- Creating new columns in a data frame by doing basic arithmetic operations (addition, subtraction, division) on existing columns.
- Creating new columns in a data frame by applying boolean operations (less-than ``<``, greater-than ``>``, equals ``==``, etc.) on existing columns.
- Deleting rows based on specific conditions.
- Removing single or multiple columns based on conditions.

Let's start by loading our dataset.

You can do this by using the ``pandas.read_csv()`` function to load the dataset into a pandas dataframe. Afterwards, store the dataframe in a variable named ``df``.

Here is a sample code:

```py
import pandas as pd

df = pd.read_csv("movies.csv")
df
```

> ***IMPORTANT NOTE: Please ensure you complete all activities in the lab in sequence. Each activity builds on the one before it, so skipping an activity will prevent further progress. Complete each task fully before moving on to the next for a successful learning experience.***

In [None]:
import pandas as pd

df = pd.read_csv("movies.csv")

In [None]:
df.columns

In [None]:
df.head()

## Activities

1. **Create a new column ``revenue``**

    Add a new column named ``revenue`` to the dataframe ``df``. This new column should reflect the difference between the values in the ``gross`` and ``budget`` columns.

In [None]:
df['revenue'] = df['gross'] - df['budget']

In [None]:
df.head()

2.  **Create a new column ``percentage_profit``**

    Create a new column called ``percentage_profit``. You will calculate its values as the proportion of the gross earnings out of the total revenue for each row. For example, if the gross earning is 100 million out of a total revenue of 200 million, the ``percentage_profit`` will be 50%.

    - Express profit percentage as a percentage.

In [None]:
df['percentage_profit'] = df['revenue']/df['gross']*100

3. **Create a new column ``high_budget_movie``**

    Add a new column named ``high_budget_movie`` to dataframe ``df``. This column should label each movie with ``True`` if it has a budget over 100 million, or ``False`` if it does not.

In [None]:
df['high_budget_movie'] = df['budget'] > 100_000_000

4. **Create a new column ``successful_movie``**

    Add a new column named ``successful_movie``. Assign it the value ``True`` if a movie's ``percentage_profit`` exceeds 50. If it doesn't, assign ``False``.

In [None]:
df['successful_movie'] = df['percentage_profit'] > 50

5. **High-Rated Movies**

    Create a new column called ``high_rated_movie``. If the movie's ``score`` is more than 8, label it as ``True``. If not, label it as ``False``.

In [None]:
df['high_rated_movie'] = df['score'] > 8

6. **Create a new column ``is_new_release``**

Create a new column named ``is_new_release``. This column should indicate ``True`` if the year column's value is beyond 2020, and ``False`` if it's not.

In [None]:
df['is_new_release'] = df['year'] > 2020

7. **Create a new column ``is_long_movie``**

    Create a new column ``is_long_movie`` which is True if the value of runtime column is greater than 150 minutes and ``False`` otherwise.

In [None]:
df['is_long_movie'] = df['runtime'] > 150

8. **Drop unsuccessful movie**

    Delete all rows in the dataframe df where the column ``successful_movie`` is labeled as ``False``. Use the inplace attribute to make sure these modifications are permanent.

In [None]:
df = df[df["successful_movie"] == True]

In [None]:
df.drop(df.loc[df['successful_movie'] == False].index, inplace=True)

9. **Drop high budget movie**

    Create a new dataframe named ``low_budget_df`` by removing all rows from the original dataframe where the ``budget`` value exceeds 100 million. Remember, changes shouldn't affect the original dataframe.

In [None]:
low_budget_df = df[df['budget']<100_000_000]

10. Removing Low-Voted Movies

    Remove all the rows from the dataframe where the ``votes`` count is below ``1000``. Assign this updated dataframe to a new variable named ``high_voted_df``. Ensure you do not make these changes to the original dataframe.

In [None]:
high_voted_df = df[df['votes']>1000]

11. **Drop the column ``budget``**

    To delete the ``budget`` column from the movie dataframe, apply the drop method and include the column's name, ``budget``. Make sure to specify the axis to show you're referring to a column, not a row. Also, set the inplace ``parameter`` to ``True`` so the change isn't temporary but permanent.

In [None]:
df.drop(columns = ['budget'], inplace=True)

12. Drop the director and writer columns from the dataframe.

    Remove the ``director`` and ``writer`` columns from the dataframe ``df``. To do this, employ the drop method, designating ``director`` and ``writer`` as the column names. Set the axis to confirm that these are columns not rows. Make sure to adjust the ``inplace`` parameter to ``False``, this way you're forming a new dataframe named ``new_df`` without altering the original one.

    Please remember, in this activity, your task is to build a new dataframe named ``new_df``.

In [None]:
new_df = df.drop(['director', 'writer'], axis=1, inplace=False)

In [None]:
new_df = df.drop(columns=['director','writer'])

13. Drop Out Low-Rated and Low-Voted Movies

    Drop all the rows where the value of score is less than ``5`` and the value of votes is less than ``1000``. Drop the rows from the original dataframe ``df``.

In [None]:
index_low = df.loc[(df['score'] < 5) & (df['votes'] < 1000) ]
df.drop(index_low.index, inplace=True)

14. **Top High-Rated Movies**

    Create a new DataFrame named ``top_rated_movies``, which should include the top five highly-rated movies. Sort this DataFrame based on the ``score`` column in descending order.

In [None]:
top_rated_movies = df.sort_values(ascending=False, by=['score'])[0:5]

In [None]:
top_rated_movies

15. Removing Specific Rows

    Remove rows with index ``2`` and ``10`` from the DataFrame ``df``.

In [None]:
df.drop([2,10], inplace=True)

16. Sci-Fi Blockbusters

    Create a new DataFrame named ``sci_fi_blockbusters`` containing movies that are 'Sci-Fi' genre and have a gross greater than $150 million.

In [None]:
sci_fi_blockbusters = df[(df['genre'] == 'Sci-Fi') & (df['gross'] > 150_000_000)]

17. **Age of Movies**

    Create a new column named ``age`` to calculate the age of the movie in years. Find it by subtracting the ``year`` column from the current year.

    Consider 2023 as the current year.

In [None]:
df['age'] = 2023 - df['year']

In [None]:
df.head()

18. **Movies Released in Summer**

    Create a new DataFrame containing movies released in June, July, or August. Store the result in dataframe ``summer_movies``.

In [None]:
summer_movies = df[(df['released'].str.contains('June')) | (df['released'].str.contains('July')) | (df['released'].str.contains('August'))]

In [None]:
summer_movies

# Modifying DataFrames: Creating columns and more

In this lab we'll cover the most common types of operations to "modify" dataframes. This includes:

- Creating new columns
- Deletion: deleting rows or columns
- Modifications: renaming columns, changing column types, modifying values
- Adding new rows

In the process, we'll also learn the important concepts of mutability/immutability and how to perform a "safe" Data Science workflow (spoiler: by avoiding modifications!)

> ***IMPORTANT NOTE: If you accidentally made incorrect modifications to the dataframe in this lab, you will need to redo all the previous steps in order to successfully complete the activity.***

## Creating new Columns

We'll start with one of the most straightforward operations: creating new columns. We can create new columns in multiple ways, but let's start with the most common one:

### Expressions (and vectorized operations)

The most common way to create a column is just as the result of an expression of other columns within the same DataFrame. If you're familiar with spreadsheets, this is a simple operation:

The syntax is extremely intuitive, it's just assigning "the new column" to a given expression:

```py
df["New Column Name"] = [EXPRESSION]
```
In this case, the expression can be anything. Examples:

```py
# A  simple arithmetic expression between two columns
df["New Column Name"] = df["Column 1"] + df["Column 2"]

# A boolean expression
df["New Column Name"] = df["Column 1"] > 1000

# A more advanced expression multiple columns
df["New Column Name"] = df["Column 1"] * (df["Column 2"] / df["Column 3"]) / df["Column 4"].std
```

Let's use our sample DataFrame to calculate "Revenue per Employee" (as in the GIF above). The expression is just:

```py
df["Revenue per Employee"] = df["Revenue"] / df["Employees"]
```

We call these expressions "vectorized operations", as they act upon the whole dataframe, regardless if it has 100 rows, or 1 billion. Vectorized Operations are extremely fast, even with large number of data.

In [None]:
import pandas as pd

In [None]:
# Lists of data
data = {'Revenue': [274515,200734,182527,181945,143015,129184,92224,85965,84893,
                    82345,77867,73620,69864,63191],
        'Employees': [147000,267937,135301,878429,163000,197000,158000,58604,
                      109700,350864,110600,364800,85858,243540],
        'Sector': ['Consumer Electronics','Consumer Electronics','Software Services',
                   'Chip Manufacturing','Software Services','Consumer Electronics',
                   'Consumer Electronics','Software Services','Consumer Electronics',
                   'Consumer Electronics','Chip Manufacturing','Software Services',
                   'Software Services','Consumer Electronics'],
        'Founding Date':['01-04-1976','13-01-1969','04-09-1998','20-02-1974',
                         '04-04-1975','15-09-1987','01-02-1984','04-02-2004',
                         '07-04-1946','01-01-1910','18-07-1968','16-06-1911',
                         '11-11-1998','07-03-1918'],
        'Country':['USA','South Korea','USA','Taiwan','USA','China','USA','USA',
                   'Japan','Japan','USA','USA','China','Japan']} 
index = ['Apple','Samsung','Alphabet','Foxconn','Microsoft','Huawei',
         'Dell Technologies','Meta','Sony','Hitachi','Intel','IBM',
         'Tencent','Panasonic']

In [None]:
df = pd.DataFrame(data, index=index)

In [None]:
df

In [None]:
df["Revenue per Employee"] = df["Revenue"] / df["Employees"]

In [None]:
df.head()

1. **Creating a new column**

    The column ``Revenue`` is expressed in *millions of dollars*. Create a new one, ``Revenue in $`` with the values for revenue expressed in $US Dollars (single units).

In [None]:
df['Revenue in $'] = df['Revenue']*1_000_000

2. **Create a new column: ``Is American?``**

    Create a new boolean column ``Is American?`` that contains the value ``True`` for companies which Country is USA, and ``False`` otherwise.

In [None]:
df['Is American?'] = df['Country'] == 'USA'

## Creating Columns out of Fixed Values

### Single (hardcoded) value

We can create columns by also providing values directly. In its simplest form, we just assign the new column to a hardcoded value:

```py
df["New Column"] = VALUE
```

This will set EVERY rows in the dataframe with that given value. In our notebook, we're setting the value ``Is Tech?`` to "Yes".

### Collection of values

Instead of providing just one value for the entire dataframe (and for every single row), we can provide a more "granular" collection containing the value for each row we want to assign.

Let's look at the example in the associated notebook. In the variable ``stock_prices`` we're storing the stock prices of the given companies. We'll then assign the values to the column ``"Stock Price"`` directly:

```py
stock_prices = [143.28, 49.87, 88.26, 1.83, 253.75, 0,
                43.4, 167.32, 89.1, 52.6, 25.58, 137.35, 48.23, 8.81]

df['Stock Price'] = stock_prices
```

This works because the list ``stock_prices`` contains the same number of elements as in the DataFrame.

> ***Note: The stock prices here are estimate. Not all companies are listed in the same exchange, so we just estimated the value in dollars. Also, ``Huawei`` is not publicly listed, so we assigned a value of $0.***

In [None]:
df['Is Tech?'] = "Yes"

In [None]:
df.head()

In [None]:
stock_prices = [143.28, 49.87, 88.26, 1.83, 253.75, 0,
                43.4, 167.32, 89.1, 52.6, 25.58, 137.35, 48.23, 8.81]

In [None]:
df['Stock Price'] = stock_prices

In [None]:
df.head()

## Activities

3. Create a new column with the CEOs of each company

    Create new column ``CEO`` that contains the names of the CEOs of each company. You'll find the list of the CEOs in the associated notebook.

In [None]:
ceo_list = [
    "Tim Cook", "Kim Ki Nam", "Sundar Pichai",
    "Young Liu", "Satya Nadella", "Ren Zhengfei",
    "Michael Dell", "Mark Zuckerberg",
    "Kenichiro Yoshida", "Toshiaki Higashihara", "Patrick Gelsinger",
    "Arvind Krishna", "Ma Huateng", "Yuki Kusumi"]

In [None]:
df['CEO'] = ceo_list

In [None]:
df.head()

### Deleting Columns with `del`

There are mainly two ways of deleting columns, using the ``del`` keyword and with the drop method. For now we'll focus only on the del keyword as the drop method introduces a few more complexities that we'll need to address later.

The ``del`` keyword is the simplest and most intuitive expression, just: ``del df["Column Name"]``. It will modify the underlying dataframe, so use it carefully!

For example, let's delete the column Is Tech? that we created before.
```py
del df["Is Tech?"]
```
Take a look now at the dataframe and see the column is no longer there.

4. Delete the column CEO

    Using the ``del`` keyword, delete the column ``CEO``.

In [None]:
del df['CEO']

In [None]:
df.head()

## Mutability and Immutability

This is a **VERY important** concept in Data Science (and programming in general). When solving problems, we usually have the option to resolve them with a "mutable" solution, that is, modifying (or *mutating*) the underlying dataframe, or with an *immutable* solution, which performs the changes but without modifying the underlying data.

For example, most of the String methods in Python are **immutable**. You can perform a wide variety of operations (``replace``, ``title``, ``upper``, ``lower``, etc) but the original string is NOT modified, these operations return NEW strings (new copies) with the desired changes applied. Take a look at the notebook for a few of these strings examples and pay attention at how the string ``s`` is not modified after any of the operations.

### Favor Immutability`

Python's decision for strings (and other, non mentioned modules) is not a coincidence. Most of the time (and only under rare exceptions), we **should prefer immutable solutions**. Specially in Pandas, operations that don't modify the underlying DataFrames or Series. That way, you can always safely try things without the risk of losing important data.

Here's an example of the flow you should expect when performing immutable operations (don't worry about the methods below, we'll learn about them in this and other projects):

```py
df = df.read_csv()
df_renamed = df.rename(...) # rename columns
df_notna = df_renamed.dropna(...)  # dropping null values
df_cleaned = df_notna.drop(...)  # dropping some values
```

As you can see, the result of each operation is the "entry point" of the following operation, creating a chain. This is intentional, because, as you'll see, we'll use this "chaining" to our advantage. It's pretty common to see expressions that are a combination (chaining) of multiple methods one after the other:

```py
df.dropna().drop([...]).rename([...]).sort_values().head()
```

### The ``inplace`` parameter

Before moving forward, we need to make a special mention about the inplace parameter.

The ``inplace`` parameter is EVERYWHERE in pandas methods, both for DataFrames and Series. For example, ``df.dropna(inplace=True)``, ``df.drop([...], inplace=True)``, ``df.drop_duplicates(inplace=True)``, etc.

The inplace parameter changes the behavior of a given method from immutable (default) to mutable, modifying the underlying DataFrame. Again, by default, ``inplace`` is always ``False``, as Pandas is always favoring immutability. You can alter that behavior by setting ``inplace=True``, although, as we just mentioned, it's NOT recommended, except in some special cases.

Now, let's move to the next section to put these concepts to use!

## Deleting rows

The method to delete "arbitrary" rows is: ``.drop``. It has some variations, as it can also be used to delete columns, but let's start with the basics.

The ``.drop()`` method accepts the indices of the values we want to remove, and as we previously mentioned, by default, is **immutable**.

In the notebook you can see an example of deleting multiple rows:

```py
df.drop(["Microsoft", "Tencent", "Samsung", "Alphabet", "Meta", "Hitachi", "Apple"])
```

Again, this method **is IMMUTABLE**. It doesn't modify the underlying dataframe: it immediately returns a new DataFrame with the modifications done. The common pattern is to assign the results of ``.drop`` to a variable: ``df_new = df.drop(...)``. This allows us to re-play any operation if we find a mistake in the process.

5. Drop Microsoft from the df

    Using ``.drop``, delete ``Microsoft`` and assign the result to ``df_no_windows``. IMPORTANT, you should NOT modify ``df``.

In [None]:
df_no_windows = df.drop('Microsoft')

In [None]:
df_no_windows.head()

### Mutable modification with ``inplace``

We have iterated and reiterated the importance of favoring immutable solutions. But still, we need to cover how to perform mutable solutions. Just remember, use mutability as spare as possible.

As we've mentioned, most pandas operations accept the ``inplace`` parameter, which, when passed a ``True`` value, will perform such operation modifying the underlying DataFrame. We'll do it now with the .``drop()`` parameter. Switch to the notebook to see how we're deleting the value for Huawei:

```py
df.drop("Huawei", inplace=True)
```

As you can see, the method doesn't return anything (just ``None``). It doesn't provide a result, as the result of the operation, is already applied to the underlying DataFrame. Check again the ``df`` to see that we have correctly deleted ``Huawei`` from the DataFrame.

6. Delete ``inplace`` the values for ``IBM`` and ``Dell``
    
    Perform a mutable operation and delete the rows containing information for ``IBM`` and ``Dell Technologies``.

In [None]:
df.drop(['IBM','Dell Technologies'], inplace = True)

## Deleting rows based on a condition

Deleting rows based on a condition is simple, as we'll just use the same ``.drop()`` method as before. What might be a little bit more "complicated" is the final syntax that will result from writing these conditions.

First all, remember that the ``.drop()`` method receives the index of the values we want to delete. Let's say we want to delete the companies with Revenue of LESS than ``M$80,000``. Those companies are ``Intel``, ``Tencent`` and ``Panasonic``. So, we have to finally arrive to an expression that is equivalent to:
```py
df.drop(["Intel", "Tencent", "Panasonic"])
```

How do we do that with conditions? It involves two steps:

First, let's write the condition:
```py
df.loc[df["Revenue"] < 80_000]
```
But then, we need the index for the result of that condition. So we append the .index to the end of the condition. So, the final expression ends up being:
```py
df.drop(df.loc[df["Revenue"] < 80_000].index)
```
That, as we said before, is not the most pleasing syntactical experience.


7. Delete companies with revenue lower than the mean

    Drop the companies that have a value of ``Revenue`` lower than the mean (average ``Revenue``). Do NOT modify the original DataFrame; store the new results in ``df_high_revenue``.

In [None]:
df['Revenue'].mean()

In [None]:
df_high_revenue = df.drop(df.loc[df['Revenue'] < df['Revenue'].mean()].index)

8. Drop the companies that are NOT from the ``USA``

    Drop the companies whose country is NOT ``USA``. Store the results in the variable ``df_usa_only``.

In [None]:
df_usa_only = df.drop(df.loc[~df['Is American?']].index)

In [None]:
df_usa_only

9. Japanese companies sorted by Revenue (desc)

    Using chaining methods, perform the following two operations in the same expression: * drop all the companies that are NOT Japanese * sort them by Revenue in descending order

    Store your results in the variable ``df_jp_desc``

In [None]:
df_jp_desc = df.drop(df.loc[df['Country'] != 'Japan'].index).sort_values(by ='Revenue' ,ascending = False)

In [None]:
df_jp_desc

### Removing columns with ``.drop()``

Finally, it's worth mentioning that the ``.drop()`` method can be used to delete columns as well, as an immutable alternative to ``del``. The syntax is the same as removing rows, but to indicate that we want to delete columns, we must pass ``axis=1`` as a parameter. By default, the axis parameter is ``0``, which means "delete at row level"; by setting it to ``1`` we're indicating we're deleting columns.

In [None]:
df.drop(['Revenue', 'Employees'], axis=1)

# Exploring DataFrames with Pokemon Data

In this project, we'll explore the basics of DataFrames using a Dataset containing Pokemon information.

We'll start by reading the dataset using the ``pd.read_csv()`` method.

Then, the ``df.head()`` and ``df.info()`` methods gives us a bit more information about the underlying data.

Play around with it and once you're ready, jump to the next page to complete the activities.

## Basic Activities

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv('pokemon.csv')

In [3]:
df.head()

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False
4,5,Charmeleon,Fire,,405,58,64,58,80,65,80,1,False


1. How many rows has our DataFrame?

    How many total rows do we have in ``df``?

In [8]:
df.shape[0]

721

2 What's the type of index of our DataFrame?

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 721 entries, 0 to 720
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   #           721 non-null    int64 
 1   Name        721 non-null    object
 2   Type 1      721 non-null    object
 3   Type 2      359 non-null    object
 4   Total       721 non-null    int64 
 5   HP          721 non-null    int64 
 6   Attack      721 non-null    int64 
 7   Defense     721 non-null    int64 
 8   Sp. Atk     721 non-null    int64 
 9   Sp. Def     721 non-null    int64 
 10  Speed       721 non-null    int64 
 11  Generation  721 non-null    int64 
 12  Legendary   721 non-null    bool  
dtypes: bool(1), int64(9), object(3)
memory usage: 68.4+ KB


In [9]:
df.index.dtype

dtype('int64')

3. How many columns does our DataFrame have?

In [11]:
df.shape[1]

13

4. What's the shape of our DataFrame?

In [12]:
df.shape

(721, 13)

5. What's the type of the column Name?

In [14]:
df['Name'].dtype

dtype('O')

6. Which of the following columns are of a numeric type?

    Mark the ones that are of some numeric type (``int``, ``float``, etc).

## Advanced Activities

7. Select the column ``Defense``

    Select the column defense and store it in the variable ``defense_col``.

In [15]:
defense_col = df['Defense']

8. What's the type of the variable ``defense_col``?

    What's the type of the variable in which you've selected the column Defense?

In [16]:
defense_col.dtype

dtype('int64')

9. What's the maximum value of ``Attack``?

    Insert the whole number, without decimals

In [17]:
df['Attack'].max()

165

10. What's the average value for ``Speed``?

    Insert the answer up to 2 decimals. Example, if the value is 84.813799 insert just 84.81.

In [18]:
df['Speed'].mean()

65.71428571428571

# Mastering DataFrame Mutations with Wine Quality Data

## Introduction and Objectives

Welcome to the hands-on practice session of the modifying data frame lesson! As you dive deeper into this section, you will have the opportunity to hone your skills in manipulating datasets. In this tutorial, we will be working with the ``wine quality dataset`` which provides insightful information about the Portuguese ``Vinho Verde`` red wine collected in 2009. This dataset contains ``12`` attributes and ``1599`` data points that reflect the physicochemical properties and quality of the wine, giving us a better understanding of consumer preferences.

Using the pandas' library, we will load the dataset with the following line of code: ``pd.read_csv(wine_quality_df)`` and inspect it with ``df.info()`` and ``df.describe()``.

> ***The dataset used for this project was taken from the publicly available and Open Source [UCL Machine Learning repository](https://archive.ics.uci.edu/ml/datasets/wine+quality).***

Before we dive into the manipulation of the dataset, it's important to clean the data to ensure accuracy. We will be using various techniques such as ``df.insert``, ``df.astype``, ``df.columns``, ``df.rename``, and ``df.drop`` to modify the data frame to our desired outcome. By the end of this project, you will be able to confidently manipulate the columns, and rows, and perform operations on the dataset with ease.

It's time to put your newly acquired skills to the test! The next page contains questions that will require you to carefully consider your answer.

## Basic Analysis

Great, let's kick off our analysis by performing some basic activities on our data. This will give us a better understanding of the dataset and allow us to uncover hidden insights. Remember, the first step to gaining knowledge is to understand the data we are working with, so let's get started! To ensure the integrity of our original data set, it's a best practice to work with a copy of the data frame when performing data manipulation. By creating a copy, we can freely experiment with various techniques and make modifications without affecting the original data.

```py
df = wine_quality_df.copy()
```

In [19]:
import pandas as pd
import numpy as np

In [38]:
wine_quality_df = pd.read_csv('winequality-red.csv', sep= ',')

In [39]:
wine_quality_df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [40]:
wine_quality_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB


In [41]:
wine_quality_df.describe()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0
mean,8.319637,0.527821,0.270976,2.538806,0.087467,15.874922,46.467792,0.996747,3.311113,0.658149,10.422983,5.636023
std,1.741096,0.17906,0.194801,1.409928,0.047065,10.460157,32.895324,0.001887,0.154386,0.169507,1.065668,0.807569
min,4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99007,2.74,0.33,8.4,3.0
25%,7.1,0.39,0.09,1.9,0.07,7.0,22.0,0.9956,3.21,0.55,9.5,5.0
50%,7.9,0.52,0.26,2.2,0.079,14.0,38.0,0.99675,3.31,0.62,10.2,6.0
75%,9.2,0.64,0.42,2.6,0.09,21.0,62.0,0.997835,3.4,0.73,11.1,6.0
max,15.9,1.58,1.0,15.5,0.611,72.0,289.0,1.00369,4.01,2.0,14.9,8.0


To ensure the integrity of our original data set, it's a best practice to work with a copy of the data frame when performing data manipulation. By creating a copy, we can freely experiment with various techniques and make modifications without affecting the original data. This way, we can have peace of mind knowing that the original data set remains untouched.

In [42]:
df = wine_quality_df.copy()

3. What is median wine quality?

    Enter the answer to 1 decimal point.

In [43]:
df['quality'].median()

6.0

### Row and Column modification

This section contains a jupyter lab activity based on row and column modification. Please launch the notebook on the right side of the screen.

4. Rename dataframe columns to appropriate format

    Rename the columns to have underscore instead of space. For example old name: ``fixed acidity`` to the new name: ``fixed_acidity``. Skip single-word columns. Set ``inplace=True``.

In [44]:
df.rename(columns={'fixed acidity':'fixed_acidity', 
                   'volatile acidity':'volatile_acidity',
                   'citric acid':'citric_acid',
                   'residual sugar':'residual_sugar',
                   'free sulfur dioxide':'free_sulfur_dioxide',
                   'total sulfur dioxide':'total_sulfur_dioxide'}, inplace = True)

In [45]:
df

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5
1,7.8,0.880,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8,5
2,7.8,0.760,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8,5
3,11.2,0.280,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8,6
4,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5
...,...,...,...,...,...,...,...,...,...,...,...,...
1594,6.2,0.600,0.08,2.0,0.090,32.0,44.0,0.99490,3.45,0.58,10.5,5
1595,5.9,0.550,0.10,2.2,0.062,39.0,51.0,0.99512,3.52,0.76,11.2,6
1596,6.3,0.510,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6
1597,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,5


5. Drop the first and last row

    Perform the modification and store it in a new variable: ``df_first_last``.

In [46]:
df_first_last = df.drop([df.index[0], df.index[-1]])

In [48]:
df_first_last = df.drop(df.iloc[[0,-1]].index)

In [49]:
df_first_last

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality
1,7.8,0.880,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8,5
2,7.8,0.760,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8,5
3,11.2,0.280,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8,6
4,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5
5,7.4,0.660,0.00,1.8,0.075,13.0,40.0,0.99780,3.51,0.56,9.4,5
...,...,...,...,...,...,...,...,...,...,...,...,...
1593,6.8,0.620,0.08,1.9,0.068,28.0,38.0,0.99651,3.42,0.82,9.5,6
1594,6.2,0.600,0.08,2.0,0.090,32.0,44.0,0.99490,3.45,0.58,10.5,5
1595,5.9,0.550,0.10,2.2,0.062,39.0,51.0,0.99512,3.52,0.76,11.2,6
1596,6.3,0.510,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6


6. Remove maximum total sulfur dioxide from dataset

    Locate and remove the row with the maximum value for ``total_sulfur_dioxide`` and store it in a new variable: ``df_drop``.

In [None]:
df