## Rajesh's DS & AI Learning

# 1. Introduction to the Data

* Apart from axis values as string `labels` and `multiple data types` 
* It has many built-in methods and functions for common exploration and analysis tasks.

We'll continue working with a data set from [Fortune](https://fortune.com/) magazine's Global [500 list 2017](https://en.wikipedia.org/wiki/Fortune_Global_500), which ranks the top 500 corporations worldwide by revenue. The data set was originally compiled [here](https://data.world/chasewillden/fortune-500-companies-2017); however, we modified the original data set to make it more accessible.

In [1]:

import os

# Define the name of your CSV file
csv_filename = "f500.csv.csv"

# Get the current directory of the Python script
current_directory = os.getcwd()

# Move back to the grandparent directory (two levels up)
project_directory = os.path.dirname(os.path.dirname(current_directory))

# Navigate to the "datasets" folder
datasets_directory = os.path.join(project_directory, "DataSets")

# Construct the full path to your CSV file
csv_path = os.path.join(datasets_directory, csv_filename)

# Check if the file exists
if os.path.exists(csv_path):
    print("CSV file found at:", csv_path)
else:
    print("CSV file not found at:", csv_path)

# import pandas module
import pandas as pd


#import dataset file 
f500=pd.read_csv(csv_path)

### TODO
* use the `DataFrame.head()` and `DataFrame.info()` methods to refamiliarize ourselves with the data.

In [None]:
f500.head(7)

In [None]:
f500.info()

# 2. Vectorized Operations

**Vectorization not only improves our code's performance, but also enables us to write code more quickly.**

* **Because pandas is an extension of NumPy, it also supports vectorized operations.**

Just like with NumPy, we can use any of the standard Python numeric operators with series, including:

* series_a + series_b - Addition
* series_a - series_b - Subtraction
* series_a * series_b - Multiplication (this is unrelated to the multiplications used in linear algebra).
* series_a / series_b - Division

## TODO:
* Subtract the values in the rank column from the values in the previous_rank column. Assign the result to rank_change.

In [4]:
rank_change=f500['previous_rank']-f500['rank']

In [None]:
rank_change[:5]

# 3. Series Data Exploration Methods

Like NumPy, pandas supports many descriptive stats methods that can help us answer these questions. Here are a few of the most useful ones (with links to documentation):

* Series.max()
* Series.min()
* Series.mean()
* Series.median()
* Series.mode()
* Series.sum()

## TODO:
* Use the Series.max() method to find the maximum value for the rank_change series. Assign the result to the variable rank_change_max.
* Use the Series.min() method to find the minimum value for the rank_change series. Assign the result to the variable rank_change_min.

In [6]:
rank_change_max=rank_change.max()
rank_change_min=rank_change.min()

In [None]:
rank_change_max

In [None]:
rank_change_min

### Observation:
* Maximum swing in rank is 226 and minimum drop in rank is by 500.
* However, according to the data dictionary, this list should only rank companies on a scale of 1 to 500. Even if the company ranked 1st in the previous year moved to 500th this year, the rank change calculated would be -499. This indicates that there is incorrect data in either the rank column or previous_rank column.

# 4. Series Describe Method

* `Series.describe()` method that can help us more quickly investigate this issue.
* This method tells us how many non-null values are contained in the series, along with the mean, minimum, maximum, and other statistics


* **If we use describe() on a column that contains non-numeric values, we get some different statistics.**
  * count  : no of non null values
  * unique : no of unique values 
  * top    : most common value
  * freq   : freq of most common value

## TODO:
* Return a series of descriptive statistics for the rank column in f500.
  * Select the rank column. Assign it to a variable named rank.
  * Use the Series.describe() method to return a series of statistics for rank. Assign the result to rank_desc.
* Return a series of descriptive statistics for the previous_rank column in f500.
  * Select the previous_rank column. Assign it to a variable named prev_rank.
  * Use the Series.describe() method to return a series of statistics for prev_rank. Assign the result to prev_rank_desc.

In [9]:
rank=f500['rank']
rank_desc=rank.describe()

In [10]:
prev_rank=f500['previous_rank']
prev_rank_desc=prev_rank.describe()

In [None]:
rank_desc

In [None]:
prev_rank_desc

### Observation:
However, this column should only have values between 1 and 500 (inclusive), so a value of 0 doesn't make sense. To investigate the possible cause of this issue, let's confirm the number of 0 values that appear in the previous_rank column.

# 5. Method Chaining

**method chaining — a way to combine multiple methods together in a single line.**

## TODO
* Use Series.value_counts() and Series.loc to return the number of companies with a value of 0 in the previous_rank column in the f500 dataframe.
* Assign the results to zero_previous_rank.

In [None]:
zero_previous_rank=f500['previous_rank'].value_counts().loc[0]
zero_previous_rank

# 6. Dataframe Exploration Methods

Because series and dataframes are two distinct objects, they have their own unique methods. However, there are many times where both series and dataframe objects have a method of the same name that behaves in similar ways. Below are some examples:

* Series.max() and DataFrame.max()
* Series.min() and DataFrame.min()
* Series.mean() and DataFrame.mean()
* Series.median() and DataFrame.median()
* Series.mode() and DataFrame.mode()
* Series.sum() and DataFrame.sum()

**Unlike their series counterparts, dataframe methods require an axis parameter so we know which axis to calculate across.**

* While you can use integers to refer to the first and second axis, 
* **pandas dataframe methods also accept the strings `"index"` and `"columns"` for the axis parameter**

## TODO:
* Use the DataFrame.max() method to find the maximum value for only the numeric columns from f500 (you may need to check the documentation). Assign the result to the variable max_f500.

In [14]:
max_f500=f500.max(numeric_only=True)

In [None]:
max_f500

# 7. Dataframe Describe Method

* One difference in series and dataframe describe() method is that we need to manually specify if you want to see the statistics for the non-numeric columns. 
* By default, DataFrame.describe() will return statistics for only numeric columns. 
* If we wanted to get just the object columns, we need to use the `include=['O']` parameter:

## TODO
* Return a dataframe of descriptive statistics for all of the numeric columns in f500. Assign the result to f500_desc.

In [16]:
# includes numeric columns

f500_desc=f500.describe()

In [None]:
f500_desc

In [None]:
# for object type columns only

f500_obj=f500.describe(include=['O'])
f500_obj

# 8. Assignment with pandas

Previously, we concluded that companies with a rank of zero didn't have a rank at all. Next, we'll replace these values with a null value to clearly indicate that the value is missing.

We'll learn how to do two things so we can correct these values:

* **Perform assignment in pandas**
* **Use boolean indexing in pandas.**

* Just like in NumPy, the same techniques that we use to select data could be used for assignment. When we selected a whole column by label and used assignment, we assigned the value to every item in that column.

* By providing labels for both axes, we can assign them to a single value within our dataframe.

# TODO:
* The company "Dow Chemical" has named a new CEO. Update the value where the row label is Dow Chemical and for the ceo column to Jim Fitterling in the f500 dataframe.

In [19]:
f500.loc['Dow Chemical','ceo']='Jim Fittering'

# 9. Using Boolean Indexing with pandas Objects

* While it's helpful to be able to replace specific values when we know the row label ahead of time, this can be cumbersome when we need to replace many values. Instead, we can `use boolean indexing to change all rows that meet the same criteria`, just like we did with NumPy.

## TODO:
* Create a boolean series, motor_bool, that compares whether the values in the industry column from the f500 dataframe are equal to "Motor Vehicles and Parts".
* Use the motor_bool boolean series to index the country column. Assign the result to motor_countries.


In [20]:
motor_bool=f500['industry'] =='Motor Vehicles and Parts'

In [21]:
motor_countries=f500.loc[motor_bool,'country'].value_counts()

In [None]:
motor_countries

# 10. Using Boolean Arrays to Assign Values

**dropna=False parameter, which stops the `Series.value_counts(dropna=false)` method from excluding null values when it makes its calculation**

## TODO
* Use boolean indexing to update values in the previous_rank column of the f500 dataframe:
  * There should now be a value of np.nan where there previously was a value of 0.
  * It is up to you whether you assign the boolean series to its own variable first, or whether you complete the operation in one line.
* Create a new pandas series, prev_rank_after, using the same syntax that was used to create the prev_rank_before series.

In [None]:
prev_rank_before = f500["previous_rank"].value_counts(dropna=False).head()
prev_rank_before

In [24]:
import numpy as np
f500.loc[f500['previous_rank']==0,'previous_rank']=np.nan

In [None]:
prev_rank_after=f500['previous_rank'].value_counts(dropna=False).head()
print(prev_rank_after)

# 11. Creating New Columns

## TODO
* Add a new column named rank_change to the f500 dataframe by subtracting the values in the rank column from the values in the previous_rank column.
* Use the Series.describe() method to return a series of descriptive statistics for the rank_change column. Assign the result to rank_change_desc.

In [None]:
f500['rank_change']=f500['previous_rank']-f500['rank']
rank_change_desc=rank_change.describe()
rank_change.head(8)

# 12. Challenge: Top Performers by Country

In [None]:
top_3_performs=f500['country'].value_counts().head(3)
top_3_performs

## TODO
* Create a series, industry_usa, containing counts of the two most common values in the industry column for companies headquartered in the USA.
* Create a series, sector_china, containing counts of the three most common values in the sector column for companies headquartered in the China.

In [28]:
industry_usa = f500["industry"][f500["country"] == "USA"].value_counts().head(2)
sector_china = f500["sector"][f500["country"] == "China"].value_counts().head(3)
mean_employees_japan = f500["employees"][f500["country"] == "Japan"].mean()

In [None]:
industry_usa

In [None]:
sector_china