<a href="https://colab.research.google.com/github/Sathvika-9/Innomatics-Research-Labs/blob/main/chapter_appendix-tools-for-deep-learning/jupyter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using Jupyter Notebooks
:label:`sec_jupyter`


This section describes how to edit and run the code
in each section of this book
using the Jupyter Notebook. Make sure you have
installed Jupyter and downloaded the
code as described in
:ref:`chap_installation`.
If you want to know more about Jupyter see the excellent tutorial in
their [documentation](https://jupyter.readthedocs.io/en/latest/).


## Editing and Running the Code Locally

Suppose that the local path of the book's code is `xx/yy/d2l-en/`. Use the shell to change the directory to this path (`cd xx/yy/d2l-en`) and run the command `jupyter notebook`. If your browser does not do this automatically, open http://localhost:8888 and you will see the interface of Jupyter and all the folders containing the code of the book, as shown in :numref:`fig_jupyter00`.

![The folders containing the code of this book.](https://github.com/d2l-ai/d2l-en-colab/blob/master/img/jupyter00.png?raw=1)
:width:`600px`
:label:`fig_jupyter00`


You can access the notebook files by clicking on the folder displayed on the webpage.
They usually have the suffix ".ipynb".
For the sake of brevity, we create a temporary "test.ipynb" file.
The content displayed after you click it is
shown in :numref:`fig_jupyter01`.
This notebook includes a markdown cell and a code cell. The content in the markdown cell includes "This Is a Title" and "This is text.".
The code cell contains two lines of Python code.

![Markdown and code cells in the "text.ipynb" file.](https://github.com/d2l-ai/d2l-en-colab/blob/master/img/jupyter01.png?raw=1)
:width:`600px`
:label:`fig_jupyter01`


Double click on the markdown cell to enter edit mode.
Add a new text string "Hello world." at the end of the cell, as shown in :numref:`fig_jupyter02`.

![Edit the markdown cell.](https://github.com/d2l-ai/d2l-en-colab/blob/master/img/jupyter02.png?raw=1)
:width:`600px`
:label:`fig_jupyter02`


As demonstrated in :numref:`fig_jupyter03`,
click "Cell" $\rightarrow$ "Run Cells" in the menu bar to run the edited cell.

![Run the cell.](https://github.com/d2l-ai/d2l-en-colab/blob/master/img/jupyter03.png?raw=1)
:width:`600px`
:label:`fig_jupyter03`

After running, the markdown cell is shown in :numref:`fig_jupyter04`.

![The markdown cell after running.](https://github.com/d2l-ai/d2l-en-colab/blob/master/img/jupyter04.png?raw=1)
:width:`600px`
:label:`fig_jupyter04`


Next, click on the code cell. Multiply the elements by 2 after the last line of code, as shown in :numref:`fig_jupyter05`.

![Edit the code cell.](https://github.com/d2l-ai/d2l-en-colab/blob/master/img/jupyter05.png?raw=1)
:width:`600px`
:label:`fig_jupyter05`


You can also run the cell with a shortcut ("Ctrl + Enter" by default) and obtain the output result from :numref:`fig_jupyter06`.

![Run the code cell to obtain the output.](https://github.com/d2l-ai/d2l-en-colab/blob/master/img/jupyter06.png?raw=1)
:width:`600px`
:label:`fig_jupyter06`


When a notebook contains more cells, we can click "Kernel" $\rightarrow$ "Restart & Run All" in the menu bar to run all the cells in the entire notebook. By clicking "Help" $\rightarrow$ "Edit Keyboard Shortcuts" in the menu bar, you can edit the shortcuts according to your preferences.

## Advanced Options

Beyond local editing two things are quite important: editing the notebooks in the markdown format and running Jupyter remotely.
The latter matters when we want to run the code on a faster server.
The former matters since Jupyter's native ipynb format stores a lot of auxiliary data that is
irrelevant to the content,
mostly related to how and where the code is run.
This is confusing for Git, making
reviewing contributions very difficult.
Fortunately there is an alternative---native editing in the markdown format.

### Markdown Files in Jupyter

If you wish to contribute to the content of this book, you need to modify the
source file (md file, not ipynb file) on GitHub.
Using the notedown plugin we
can modify notebooks in the md format directly in Jupyter.


First, install the notedown plugin, run the Jupyter Notebook, and load the plugin:

```
pip install d2l-notedown  # You may need to uninstall the original notedown.
jupyter notebook --NotebookApp.contents_manager_class='notedown.NotedownContentsManager'
```

You may also turn on the notedown plugin by default whenever you run the Jupyter Notebook.
First, generate a Jupyter Notebook configuration file (if it has already been generated, you can skip this step).

```
jupyter notebook --generate-config
```

Then, add the following line to the end of the Jupyter Notebook configuration file (for Linux or macOS, usually in the path `~/.jupyter/jupyter_notebook_config.py`):

```
c.NotebookApp.contents_manager_class = 'notedown.NotedownContentsManager'
```

After that, you only need to run the `jupyter notebook` command to turn on the notedown plugin by default.

### Running Jupyter Notebooks on a Remote Server

Sometimes, you may want to run Jupyter notebooks on a remote server and access it through a browser on your local computer. If Linux or macOS is installed on your local machine (Windows can also support this function through third-party software such as PuTTY), you can use port forwarding:

```
ssh myserver -L 8888:localhost:8888
```

The above string `myserver` is the address of the remote server.
Then we can use http://localhost:8888 to access the remote server `myserver` that runs Jupyter notebooks. We will detail on how to run Jupyter notebooks on AWS instances
later in this appendix.

### Timing

We can use the `ExecuteTime` plugin to time the execution of each code cell in Jupyter notebooks.
Use the following commands to install the plugin:

```
pip install jupyter_contrib_nbextensions
jupyter contrib nbextension install --user
jupyter nbextension enable execute_time/ExecuteTime
```

## Summary

* Using the Jupyter Notebook tool, we can edit, run, and contribute to each section of the book.
* We can run Jupyter notebooks on remote servers using port forwarding.


## Exercises

1. Edit and run the code in this book with the Jupyter Notebook on your local machine.
1. Edit and run the code in this book with the Jupyter Notebook *remotely* via port forwarding.
1. Compare the running time of the operations $\mathbf{A}^\top \mathbf{B}$ and $\mathbf{A} \mathbf{B}$ for two square matrices in $\mathbb{R}^{1024 \times 1024}$. Which one is faster?


[Discussions](https://discuss.d2l.ai/t/421)


In [6]:
import pandas as pd
import numpy as np
from scipy import stats

# Assuming 'df' is the DataFrame containing all the necessary property data

# Example DataFrame
data = {'property_age': [5, 15, 25]}
df = pd.DataFrame(data)


In [7]:
# Task 1: Create a property_age_category column
def categorize_property_age(age):
    if 0 <= age <= 1:
        return 'New'
    elif 1 < age <= 5:
        return 'Less than 5 years'
    elif 5 < age <= 10:
        return '5 to 10 years'
    elif 10 < age <= 20:
        return '10 to 20 years'
    else:
        return 'More than 20 years'
df['property_age_category'] = df['property_age'].apply(categorize_property_age)


In [9]:
# Task 2: Find the most frequent property_age_category
most_frequent_category = df['property_age_category'].mode()[0]
print(f"The most frequent property age category is: {most_frequent_category}")

The most frequent property age category is: 10 to 20 years


In [14]:
# Task 3: Create time_category based on the hour of the day
def categorize_time_of_day(hour):
    if pd.isna(hour):  # Handle missing values
        return 'Unknown'
    if 0 <= hour < 6:
        return 'Midnight'
    elif 6 <= hour < 12:
        return 'Morning'
    elif 12 <= hour < 18:
        return 'Afternoon'
    else:
        return 'Evening'






In [20]:
# Task 4: Find the most frequent time_category
# Apply the function to 'request_date' to create 'time_category'
df['time_category'] = df['request_date'].apply(
    lambda x: categorize_time_of_day(x.hour) if pd.notna(x) else 'Unknown'
)

# Find the most frequent time_category
most_frequent_time_category = df['time_category'].mode()[0]
print(f"The most frequent time category is: {most_frequent_time_category}")

KeyError: 'request_date'

In [22]:
# Task 4: Find the most frequent time_category
# Generate a random 'request_time' column for demonstration
import random
df['request_time'] = [random.randint(0, 23) for _ in range(len(df))]
# Apply the function to 'request_time' to create 'time_category'
df['time_category'] = df['request_time'].apply(categorize_time_of_day)

# Find the most frequent time_category
most_frequent_time_category = df['time_category'].mode()[0]
print(f"The most frequent time category is: {most_frequent_time_category}")

The most frequent time category is: Midnight


In [25]:

# Task 5: Calculate the percentage of properties available for lease under 'Anyone'
# Simulate 'lease_type' column with random values
import random
lease_types = ['Anyone', 'Family', 'Bachelors']
df['lease_type'] = random.choices(lease_types, k=len(df))

total_properties = len(df)
properties_anyone = len(df[df['lease_type'] == 'Anyone'])
percentage_anyone = (properties_anyone / total_properties) * 100
print(f"Percentage of properties available for lease under 'Anyone': {percentage_anyone:.2f}%")


Percentage of properties available for lease under 'Anyone': 66.67%


In [30]:
# Task 6: Identify the top localities with the highest average rent and find interaction counts
# Check if 'locality' column exists
# Check the columns in the dataframe
print(df.columns)

# If 'rent' is missing, you can replace it with the correct column name
# For example, if the rent column is named 'price', use that instead of 'rent'

# Assuming 'rent' is the correct column and it exists
if 'rent' in df.columns:
    top_localities = df.groupby('locality')['rent'].mean().nlargest(5).index
    interaction_counts_by_locality = df[df['locality'].isin(top_localities)].groupby('locality')['total_interactions'].max()

    print("Interaction counts in top localities with the highest average rent:")
    print(interaction_counts_by_locality)
else:
    print("The 'rent' column is missing. Please check your dataframe.")


Index(['property_age', 'property_age_category', 'request_time',
       'time_category', 'lease_type', 'locality'],
      dtype='object')
The 'rent' column is missing. Please check your dataframe.


In [32]:
# Task 7: Create a photo_count feature
def count_photos(photo_urls):
    if pd.isna(photo_urls):
        return 0
    try:
        return len(eval(photo_urls))  # Safely parse JSON-like strings
    except Exception as e:
        print(f"Error parsing photo_urls: {e}")
        return 0

# Check if the 'photo_urls' column exists before applying the function
if 'photo_urls' in df.columns:
    df['photo_count'] = df['photo_urls'].apply(count_photos)
else:
    print("'photo_urls' column is missing.")



'photo_urls' column is missing.


In [34]:
# Task 8: Find the photo_count of the property with the highest interactions
# Check if 'total_interactions' column exists
if 'total_interactions' in df.columns:
    # Find the property with the highest interactions
    max_interactions_property = df.loc[df['total_interactions'].idxmax()]
    # Get the photo count of that property
    photo_count_max_interactions = max_interactions_property['photo_count']
    print(f"Photo count of the property with the highest interactions: {photo_count_max_interactions}")
else:
    print("'total_interactions' column is missing.")


'total_interactions' column is missing.


In [38]:
#task 9:Perform hypothesis test - Compare average interactions for properties with and without a gym
from scipy import stats

# Check if 'gym' column exists
if 'gym' in df.columns:
    # Split data into with and without gym
    with_gym = df[df['gym'] == 1]['total_interactions']
    without_gym = df[df['gym'] == 0]['total_interactions']

    # Perform t-test
    t_stat, p_value = stats.ttest_ind(with_gym, without_gym)

    # Check the result of the hypothesis test
    if p_value < 0.05:
        print("Reject the null hypothesis: Properties with a gym have significantly different average interactions.")
    else:
        print("Fail to reject the null hypothesis: No significant difference in average interactions between properties with and without a gym.")
else:
    print("'gym' column is missing.")


'gym' column is missing.


In [39]:
# Task 10: Percentage of properties available for lease under 'Anyone'
# Already covered in Task 5


In [41]:
# Task 11: Identify the date with the most property activations
# Check if 'activation_date' column exists
if 'activation_date' in df.columns:
    # Convert 'activation_date' to datetime format if not already
    df['activation_date'] = pd.to_datetime(df['activation_date'], errors='coerce')

    # Find the date with the most property activations
    most_active_date = df['activation_date'].value_counts().idxmax()
    print(f"Most properties were activated on: {most_active_date}")
else:
    print("'activation_date' column is missing.")


'activation_date' column is missing.


In [43]:
# Task 12: Find the locality with the highest average rent
# Check if 'locality' and 'rent' columns exist
if 'locality' in df.columns and 'rent' in df.columns:
    # Find the locality with the highest average rent
    highest_rent_locality = df.groupby('locality')['rent'].mean().idxmax()
    print(f"The locality with the highest average rent is: {highest_rent_locality}")
else:
    print("'locality' or 'rent' column is missing.")


'locality' or 'rent' column is missing.


In [45]:
# Task 13: Identify the apartment type with the highest average interactions
# Check if 'type' and 'total_interactions' columns exist
if 'type' in df.columns and 'total_interactions' in df.columns:
    # Identify the apartment type with the highest average interactions
    apartment_interactions = df.groupby('type')['total_interactions'].mean()
    apartment_type_highest_interactions = apartment_interactions.idxmax()
    print(f"The apartment type with the highest average interactions is: {apartment_type_highest_interactions}")
else:
    print("'type' or 'total_interactions' column is missing.")


'type' or 'total_interactions' column is missing.


In [47]:
# Task 14: Find the amenity with the greatest impact on rent
# List of amenities to check
amenities = ['gym', 'lift', 'swimming_pool']

# Check if the amenity columns exist
missing_amenities = [amenity for amenity in amenities if amenity not in df.columns]

if missing_amenities:
    print(f"Missing amenity columns: {', '.join(missing_amenities)}")
else:
    # Calculate the correlation between each amenity and rent
    for amenity in amenities:
        correlation = df[amenity].corr(df['rent'])
        print(f"Correlation between {amenity} and rent: {correlation:.2f}")


Missing amenity columns: gym, lift, swimming_pool
