#**<font color='#0969DA'>Guided Lab 343.3.4 - Exploratory Data Analysis on json data - Basic insights from the Data</font>**
---

## **Lab Overview:**

This lab focuses on performing Exploratory Data Analysis (EDA) on a JSON dataset using Python and the Pandas library. The lab aims to guide you through the following key concepts:

1. **Data Type Inspection:** Understanding the importance of checking data types for potential mismatches and compatibility with Python methods. This is demonstrated using the `dtypes` attribute of Pandas DataFrames.

2. **Descriptive Statistics:** Calculating and interpreting basic statistical measures such as mean, standard deviation, minimum, maximum, and quartiles using the `describe()` method.

3. **Concise Summary:** Obtaining a comprehensive overview of the dataset, including column names, data types, memory usage, and non-null values, using the `info()` method.

4. **Data Selection:** Extracting specific records or subsets of the data using the `head()`, `tail()`, `at`, and `iat` functions, enabling efficient exploration of large datasets.

5. **Data Shape and Size:** Determining the number of rows and columns using the `shape` attribute and exploring alternative methods like `axes` and `len` to access this information.

**Learning Outcomes:**

By the end of this lab, you should be able to:

* Confidently load and manipulate JSON data in Python using Pandas.
* Utilize various Pandas functions to perform basic EDA tasks.
* Interpret descriptive statistics and summaries to gain insights into data.
* Efficiently extract and analyze specific subsets of data.
* Understand the structure and dimensions of a dataset.

**Dataset:**

The lab utilizes a JSON dataset named ['cars.json'](https://drive.google.com/file/d/1CXAK8gbuLtc2NNOXVUgmja8fDg0TrNZm/view) as the primary data source for analysis and demonstration.


###**<font color='#0969DA'>How to check Data types in Pandas**



- In pandas, we use **dtypes** attribute to check data types.
- Why check data types?
 - potential info and type mismatch.
 - compatibility with python methods.
---
#**Begin**

The lab follows a step-by-step approach, starting with loading the JSON data into a Pandas DataFrame. It then proceeds with exploring the data's characteristics, calculating statistics, selecting specific records, and understanding the dataset's structure.

In [None]:
import pandas as pd
# Read JSON file
df_cars = pd.read_json('cars.json')
print(df_cars.dtypes)

##**<font color='#0969DA'>Determining Descriptive Statistics**

- Pandas provides many statistical methods for DataFrames. You can get basic statistics summary for the numerical columns of a Pandas DataFrame with **describe()** method.

Visit this link for all descriptive related methods.
https://pandas.pydata.org/pandas-docs/stable/reference/frame.html#computations-descriptive-stats

- Example: Consider the **cars.json** dataset

In [None]:
import pandas as pd
df_cars = pd.read_json('cars.json')
df_cars.describe()

in the above result, describe() returns a new DataFrame with the number of rows indicated by count, as well as the mean, standard deviation, minimum, maximum, and quartiles of the columns.

---



##**<font color='#0969DA'>Determine Basic Concise summary</font>**

Pandas provides many statistical methods for DataFrames. You can get basic concise summary for the Pandas DataFrame with **info()** method.

In other words, info function gives metadata of panda DataFrame, Which includes,

- Number of rows and its range of index
- Total number of columns
- List of columns
- Count of the total number of non-null values in the column
- Data type of column
- Count of columns in each data type
- Memory usage by the DataFrame

Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html

# **<font color='#0969DA'>DataFrame Count</font>**

df.count():
DataFrame Count will return the number of Non-NA values within each column. I don’t love this one because 1) it’s slower and 2) you need to do extra data work after your call .count().

Be careful, if you have NAs in your dataset, you may get confusing. The count() will skip these by default.

In [None]:
df_cars.info()

In the above result, the information contains the number of columns, column labels, column data types, memory usage, range index, and the number of cells in each column (non-null values).

---



##**<font color='#0969DA'>Select few records</font>**

The **head()** and  **tail()** functions use to select top and bottom rows of the Pandas DataFrame respectively. It is beneficial when we have massive datasets, and it is not possible to see the entire dataset at once.

**Example: Consider the cars.json dataset**

You can use **head(2)** function, only the first 2 rows of the DataFrame are displayed.

In [None]:
df_cars.head(2)



---



You can use **tail(2)** function, only the last 2 rows of the DataFrame are displayed.

In [None]:
df_cars.tail(2)



---



##**<font color='#0969DA'>Select Specific records</font>**

 Also, **at** and **iat** properties to access a specific element in the DataFrame.

Example: Using **at** property:

**Consider the cars.json dataset**



In [None]:
df_cars.at[157, 'MPG']


**DataFrame.iat:** We want to access a specific element from a very large DataFrame, but we do not know its column label or row index. We can still access such an element using its column and row positions. For that, we can use iat property of python pandas.

**Example: Using iat property:**
In this example, we will access the 157 row and the 1st column.

In [None]:
df_cars.iat[157, 1]



---



# **<font color='#0969DA'>DataFrame Shape</font>**
##**Find number of rows and columns**
The number of rows and columns of a DataFrame can be identified using the .**shape ** attribute of the Panda DataFrame. It returns a tuple (row, column) and can be indexed to get only rows, and only columns count as output.


**- df.shape[0] - To count rows**

**- df.shape[1] - To count columns**

In [None]:

print(df_cars.shape) # Get the number of rows and columns
print(df_cars.shape[0]) # Get the number of rows only
print(df_cars.shape[1]) # Get the number of columns only

In [None]:
import pandas as pd
# Create DataFrame from dict
student_dict = {'Name': ['Joe', 'Nat', 'Harry'], 'Age': [20, 21, 19], 'Marks': [85.10, 77.80, 91.54]}

student_df = pd.DataFrame(student_dict)

list_Index = student_df.columns    # get col index
print(list_Index)
label = student_df.columns[0]  # 1st col label
print(student_df.columns[0])
Get_As_List = student_df.columns.tolist() # get as a list
print(GetAs_List)



---

#**<font color='#0969DA'>DataFrame Axes Length**</font>

**len(df.axes[0]):** Next up is our most verbose option – DataFrame Axes Length.

This axes attribute will return your row axis, then you must count the length of it.

Let’s break this one down. **df.axes** will return a tuple of your two axes for rows and columns. [0] will pull the first item (rows) from the tuple. Then finally **len()** will find the length, or how many items, you have in your axis which is your row count.

 Let's look through it step by step.

- Return both axis (rows/columns)

- Pull our the rows

- Count the length

In [None]:
df_cars.axes

In [None]:
df_cars.axes[0]

In [None]:
len(df_cars.axes[0])

##**Submission**
- Submit your completed lab using the Start Assignment button on the assignment page in Canvas.
- Your submission can be include:
  - if you are using notebook then, all tasks should be written and submitted in a single notebook file, for example: (**your_name_labname.ipynb**).
  - if you are using python script file, all tasks should be written and submitted in a single python script file for example: **(your_name_labname.py)**.
- Add appropriate comments and any additional instructions if required.
