[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Humboldt-WI/IPML/blob/master/rpf_case_study_part-1.ipynb)


# Resale Price Forecasting Case Study - Part 1: Data Access and Exploration

The lecture introduced you to resale price forecasting, a task to support decision-making in the leasing business. In this tutorial, we will explore a dataset representing resale price prediction. Throughout a series of tutorials, we will go through the different 
stages of a machine learning process, from initial data exploration to sophisticated predictive modeling and insightful post-hoc analysis. This notebook is only the start of a journey. 


---

# Loading the Resale Price Prediction Dataset
You can find the Resale Price Prediction dataset for this notebook on our Moodle page and in our GitHub repository. The Resale Price Prediction dataset focuses on laptops that have been leased and returned, aiming to predict their resale prices. The resale price is influenced by various factors, including the original retail price, depreciation, release year, screen size, hard drive size, RAM size, weight, lease duration, and battery capacity. Our final goal is to use the features to forecast resale prices. For start, however, we will explore the data to better understand its characteristics. Furthermore, data exploration facilitates learning about relevant Python libraries. For example, we will use the `pandas` library. Pandas is a widely used library for data analysis and manipulation in Python, providing powerful tools for handling structured data.



In [None]:
import pandas as pd  # Load pandas library

# Dataset URL
url = 'https://raw.githubusercontent.com/Humboldt-WI/IPML/main/data/resale_price_dataset.csv'

# Load data from URL
data = pd.read_csv(url)

Let's first take a look at the data. To that end, we use the function `.head()`, which creates a preview of the data.

In [None]:
# Display first few rows
data.head()


# Descriptive Statistics
The pandas library offers various functions to compute descriptive statistics, which help us summarize and understand the main characteristics of our dataset. Descriptive statistics provide insights into the distribution, central tendency, and variability of the data, allowing us to quickly grasp its overall structure. Furthermore, the pandas library offers functions to understand the data types and identify missing values in our dataset.

The relevant functions we will use for our first examination of the data are `pd.DataFrame.info()` and `pd.DataFrame.describe()`.

The `pd.DataFrame.info()` function reveals the high-level structure of our data table. Note that pandas uses the term *data frame* to refer to a table. The `pd.DataFrame.info()` function provides information on the number of entries (eg, rows), the data types of each column, and the number of missing values if any. Understanding these details is crucial for further analysis of the data. 

The `pd.DataFrame.describe()` function computes a suite of summary statistics for each column in our dataset. Examples include the average of a column or its standard deviation. 

But why are these functions essential? Both ```info()``` and ```describe()``` help us establish a foundational understanding of our dataset's distribution, scale, and tendencies. While ```info()``` gives us a structural overview, ```describe()``` takes us a step further into the statistical nature of each column. By noting aspects like the mean, standard deviation, and minimum/maxium values, we can swiftly detect outliers, identify patterns, and formulate hypotheses for further investigation.

Together, these methods serve as our initial checkpoint, ensuring that we're not only aware of the dataset's composition but also acquainted with its statistical properties. 

In [None]:
# Dataset structure and info
data.info()

In [None]:
# Summary statistics
data.describe()

## Data visualisation

In this subsection, we take a graphical approach to understand our Resale Price Prediction dataset. For this purpose we first load the two most prominent libraries for data visualization – `Matplotlib` and `Seaborn`.

**Matplotlib**: A foundational plotting library, Matplotlib is the granddaddy of Python visualization tools. It offers immense flexibility and allows us to create a wide variety of charts and plots with fine-grained control over every aspect of the visuals. Whether it's histograms, scatter plots, or line charts, Matplotlib provides the functionalities to craft them all with detailed customizations.

**Seaborn**: Built on top of Matplotlib, Seaborn simplifies many visual tasks, making sophisticated plots accessible and understandable. It comes with built-in themes and color palettes that enhance the aesthetics of our visualizations. Seaborn is particularly adept at handling statistical graphics, making it easier to visualize complex datasets with just a few lines of code.

In [None]:
# Import libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

## Histograms

Our first stop is the world of histograms — a type of plot that lets us see the frequency distribution of a single variable. By plotting histograms for all the features in our dataset, we can visually grasp the distribution of data points and detect any skewness or anomalies that might exist. This understanding is crucial as it directly influences how certain machine learning models might perform with our data.

In [None]:
# Plot histograms for all numeric columns
data.hist(figsize=(12, 10))
plt.tight_layout()
plt.show()


## Correlation


Next, we consider the correlation between table columns. Recall that *correlation* is a measure for how much two numerical random variables (e.g., table columns) are linearly related. To compute the pairwise correlation between all table columns, we can use the `corr()` function, which Pandas provides. Afterwards, we can visualize all the pairwise correlations as a heatmap for easy inspection. To achieve this, we will use the `heatmap` function from the Seaborn library. 

In [None]:
# Correlation matrix and heatmap
correlation_matrix = data.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Heatmap')
plt.tight_layout()
plt.show()