## Rajesh's DS & AI Learning

# 1. Understanding pandas and NumPy

### Although NumPy provides fundamental structures and tools that make working with data easier, there are several things that limit its usefulness:

* The lack of support for column names forces us to frame questions as multi-dimensional array operations.
* Support for only one data type per ndarray makes it more difficult to work with data that contains both numeric and string data.
* There are lots of low level methods, but there are many common analysis patterns that don't have pre-built methods.

 * Pandas is not so much a replacement for NumPy as an extension of NumPy. The underlying code for pandas uses the NumPy library extensively
 **The primary data structure in pandas is called a dataframe. Dataframes are the pandas equivalent of a Numpy 2D ndarray, with a few key differences:**

* Axis values can have string labels, not just numeric ones.
* Dataframes can contain columns with multiple data types: including integer, float, and string.

# 2. Introduction to the Data

we'll work with a data set from [Fortune](https://fortune.com/) magazine's [2017 Global 500 list](https://en.wikipedia.org/wiki/Fortune_Global_500), which ranks the top 500 corporations worldwide by revenue. The data set was originally compiled [here](https://data.world/chasewillden/fortune-500-companies-2017)

In [1]:
import os

# Define the name of your CSV file
csv_filename = "f500.csv.csv"

# Get the current directory of the Python script
current_directory = os.getcwd()

# Move back to the grandparent directory (two levels up)
project_directory = os.path.dirname(os.path.dirname(current_directory))

# Navigate to the "datasets" folder
datasets_directory = os.path.join(project_directory, "DataSets")

# Construct the full path to your CSV file
csv_path = os.path.join(datasets_directory, csv_filename)

# Check if the file exists
if os.path.exists(csv_path):
    print("CSV file found at:", csv_path)
else:
    print("CSV file not found at:", csv_path)

#import pandas module
import pandas as pd 

# read f500 dataset file
f500 = pd.read_csv(csv_path,index_col=0)
f500.index.name = None
f500_type=type(f500)
f500_shape=f500.shape

# 3. Introducing DataFrames

* To view the first few rows of our dataframe, we can use the DataFrame.head(no_of_rows) method. 
* To view the last few rows of our dataframe, we can use the DataFrame.tail(np_of_rowws) method. 

In [2]:
# To view upper few rows
f500_head=f500.head(6)
f500_tail=f500.tail(7)

In [None]:
f500_head

In [None]:
f500_tail

# 4. Introducing DataFrames Continued

* Another feature that makes pandas better for working with data is that dataframes can contain more than one data type

* We can use the `DataFrame.dtypes` attribute (similar to NumPy's ndarray.dtype attribute) to return information about the types of each column. 
*  Pandas uses NumPy dtypes for numeric columns, including integer64.
* There is also a type we haven't seen before, `object`, which is used for columns that have data that doesn't fit into any other dtypes. This is almost always used for columns containing string values.
* When we import data, pandas will attempt to guess the correct dtype for each column.

In [None]:
f500.dtypes

* If we wanted an overview of all the dtypes used in our dataframe, along with its shape and other information, we could use the DataFrame.info() method. 
* Note that DataFrame.info() prints the information, rather than returning it, so we can't assign it to a variable.

In [None]:
f500.info()

# 5. Selecting a Column From a DataFrame by Label

* Because our axes in pandas have labels, we can select data using those labels — unlike in NumPy, where we needed to know the exact index location. 
* To do this, we can use the DataFrame.loc[] attribute. The syntax for DataFrame.loc[] is:

`df.loc[row_label, column_label]`

## ToDo:
* Select the industry column. Assign the result to the variable name industries.
* Use Python's type() function to assign the type of industries to industries_type.

In [None]:
industries=f500.loc[:,'industry']
type(industries)

# 6. Introduction to Series

**Series is the pandas type for one-dimensional objects. Anytime you see a 1D pandas object, it will be a series. Anytime you see a 2D pandas object, it will be a dataframe.**

* In fact, you can think of a dataframe as a collection of series objects, which is similar to how pandas stores the data behind the scenes

# 7. Selecting Columns From a DataFrame by Label Continued

* we use a `list of labels` to select specific columns


<block><pre>

A summary of the techniques we've learned so far is below:

Select by Label	           Explicit Syntax	            Common Shorthand
Single column	           df.loc[:,"col1"]       	        df["col1"]
List of columns	           df.loc[:,["col1", "col7"]]	    df[["col1", "col7"]]
Slice of columns	       df.loc[:,"col1":"col4"]	
<block></pre>

## TODO:
* Select the country column. Assign the result to the variable name countries.
* In order, select the revenues and years_on_global_500_list columns. Assign the result to the variable name revenues_years.
* In order, select all columns from ceo up to and including sector. Assign the result to the variable name ceo_to_sector.

In [8]:
countries=f500['country']
revenues_years=f500[['revenues','years_on_global_500_list']]
ceo_to_sector=f500.loc[:,'ceo':'sector']

In [None]:
countries

In [None]:
revenues_years

In [None]:
ceo_to_sector

# 8. Selecting Rows From a DataFrame by Label

`df.loc[row_label, column_label]`

## TODO:
By selecting data from f500:
* Create a new variable toyota, with:
  * Just the row with index Toyota Motor.
  * All columns.
* Create a new variable, drink_companies, with:
  * Rows with indicies Anheuser-Busch InBev, Coca-Cola, and Heineken Holding, in that order.
  * All columns.
* Create a new variable, middle_companies with:
  * All rows with indicies from Tata Motors to Nationwide, inclusive.
  * All columns from rank to country, inclusive.

In [12]:
#select single single row
toyota=f500.loc['Toyota Motor',:]

#select list of rows
drink_companies=f500.loc[['Anheuser-Busch InBev','Coca-Cola','Heineken Holding']]

# select slice of rows
middle_companies=f500.loc['Tata Motors':'Nationwide','rank':'country']

In [None]:
toyota

In [None]:
drink_companies

In [None]:
middle_companies

# 9. Series vs Dataframes

<block><pre>
                                      Column                            row
 1. Select single                    df['col']                          df.loc['row']
 2. Select list of                   df[['col1','col2','col3']]         df.loc[['row1','row2','row3']]
 3. Select slice of                  df[:,'col1':'col5']                 df['row1':'row3']

<block></pre>
* where single column or row is Series and more than one column or rows are Dataframe objects.

# 10. Value Counts Method

* Because series and dataframes are two distinct objects, they have their own unique methods.

 * `Series.value_counts()` method. This method displays each unique non-null value in a column and their counts in order.

## TODO:
* Select the country column in the f500_sel dataframe. Assign it to a variable named countries.
* Use the Series.value_counts() method to return the value counts for countries. Assign the results to country_counts.

In [16]:
countries=f500['country']
country_counts=countries.value_counts()

In [None]:
country_counts

# 11. Selecting Items from a Series by Label

<block><pre>

Select by Label	                            Explicit Syntax	                Shorthand Convention
Single item from series	                     s.loc["item8"]	                   s["item8"]
List of items from series	                 s.loc[["item1","item7"]]	       s[["item1","item7"]]
Slice of items from series	                 s.loc["item2":"item4"]	           s["item2":"item4"] 

<block></pre>

## TODO
From the pandas series countries_counts:
* Select the item at index label India. Assign the result to the variable name india.
* In order, select the items with index labels USA, Canada, and Mexico. Assign the result to the variable name north_america.

In [18]:
india=country_counts['India']

In [19]:
north_america=country_counts[['USA','Canada','Mexico']]

In [None]:
india

In [None]:
north_america

# 12. Summary Challenge

<block><pre>
Select by Label                                	Explicit Syntax             	Shorthand Convention
Single column from dataframe	                  df.loc[:,"col1"]	              df["col1"]
List of columns from dataframe                    df.loc[:,["col1","col7"]]        df[["col1","col7"]]
Slice of columns from dataframe	                  df.loc[:,"col1":"col4"]	       -
Single row from dataframe	                      df.loc["row4"]	               -
List of rows from dataframe	                      df.loc[["row1", "row8"]]	       -
Slice of rows from dataframe	                  df.loc["row3":"row5"]	          df["row3":"row5"]
Single item from series	                          s.loc["item8"]	              s["item8"]
List of items from series	                      s.loc[["item1","item7"]]	      s[["item1","item7"]]
Slice of items from series	                      s.loc["item2":"item4"]	      s["item2":"item4"]


<block></pre>

## TODO
By selecting data from f500:

* Create a new variable big_movers, with:
  * Rows with indices Aviva, HP, JD.com, and BHP Billiton, in that order.
  * The rank and previous_rank columns, in that order.
* Create a new variable, bottom_companies with:
   * All rows with indices from National Grid to AutoNation, inclusive.
   * The rank, sector, and country columns.

In [22]:
big_movers=f500.loc[['Aviva','HP','JD.com','BHP Billiton'],['rank','previous_rank']]
bottom_companies=f500.loc['National Grid':'AutoNation',['rank','sector','country']]

In [None]:
big_movers

In [None]:
bottom_companies