**Estimating the Hubble Constant Using Type Ia Supernovae**

One of the most important discoveries in modern astronomy is that the universe is expanding. This means that, on large scales, galaxies move away from one another over time. The rate of this expansion is described by a number called the Hubble Constant, which quantifies how fast a galaxy appears to be moving away based on its distance.
In simple terms, galaxies that are farther from Earth tend to be moving away faster. This relationship, known as Hubble’s Law, provides one of the key pieces of evidence for the expansion of the universe. While the basic idea is straightforward, measuring the exact value of the Hubble Constant has proven surprisingly difficult. Different measurement methods sometimes yield slightly different results, making this an active area of scientific research.
The goal of this project is to estimate the Hubble Constant using real astronomical data and to assess how well basic data analysis techniques reproduce this well-known relationship. Rather than trying to resolve ongoing debates in cosmology, this project focuses on understanding how choices in data cleaning, visualization, and modeling affect scientific conclusions when working with real observational data.

To do this, the project uses observations of Type Ia supernovae, which are powerful stellar explosions that can be used to measure cosmic distances. Because these supernovae have very similar intrinsic luminosities, astronomers can determine their distances by comparing their apparent brightnesses from Earth. By combining distance information with measurements of how much the universe has expanded since the light was emitted, it is possible to study the relationship between distance and recession speed.
Because Hubble’s Law works best for relatively nearby objects, this analysis focuses on supernovae in the nearby universe. This allows simple linear models to be used while also clarifying where they begin to break down.


**Dataset Description**

The data used in this project come from publicly available collections of Type Ia supernova observations, such as the Pantheon+ dataset. Each observation includes information on how much the light from a supernova has been stretched by the expansion of the universe, as well as an estimate of the supernova's distance.
Several key pieces of information are used in the analysis:
Redshift: A measure of how much the light from a distant object has been stretched as the universe expands. Larger values generally correspond to greater distances.


**Recession velocity**: An estimate of how fast a galaxy is moving away from Earth, inferred from redshift measurements.


**Distance**: The estimated distance to the supernova, measured in megaparsecs (a standard astronomical distance unit).


**Uncertainty estimates**: Measurements are not perfect; each distance and redshift value entails some uncertainty, which affects the confidence with which models can be interpreted.


The dataset contains many supernovae spanning a wide range of distances. However, at vast distances, the relationship between distance and recession speed becomes more complex due to temporal variations in the expansion rate of the universe. To keep the analysis simple and physically meaningful, this project focuses on supernovae with relatively small redshift values, representing the nearby universe.


**Justification of Approach**

The primary analytical approach employed in this project is linear regression, a standard method for identifying and describing relationships between two variables. Linear regression is well-suited for this problem because Hubble’s Law predicts a straight-line relationship between distance and recession speed for nearby galaxies.
Before fitting any models, the data were visualized using scatterplots to understand the overall structure of the relationship. These plots showed a clear trend: supernovae at greater distances tend to have higher recession velocities. However, the scatter also increases at larger distances, suggesting that uncertainty and additional physical effects become more important.
To assess whether a more complex model would improve fit, a quadratic model was also tested. This model allows for gentle curvature rather than a straight line. Comparing the linear and quadratic models helps determine whether added complexity meaningfully improves the results or simply fits noise in the data.
Examining differences between observed values and model predictions (residuals) is essential in this analysis. Residual plots help reveal patterns that might indicate when a model’s assumptions are no longer valid, making them a valuable diagnostic tool beyond simple summary statistics.


**Mining Methods and Implementation**

The analysis was conducted in Python using widely used data science libraries. Data were cleaned and organized using tools designed for tabular data, and visualizations were created to illustrate trends and patterns in the dataset clearly.
Linear regression was applied to the subset of nearby supernovae to estimate the Hubble Constant. In this context, the slope of the best-fit line represents the estimated expansion rate of the universe. The intercept reflects the influence of local motions and measurement uncertainty rather than a physically meaningful value.
To assess the sensitivity of the results to model choice, a quadratic regression model was also fitted to the same data. While this model slightly improves specific numerical performance measures, it is more challenging to interpret physically.
Throughout the analysis, care was taken to keep the workflow precise and reproducible. Plots include labeled axes and units, variables are named descriptively, and intermediate results are examined to ensure transparency.


**Results**

Applying linear regression to the nearby supernova data produces an estimated Hubble Constant of approximately 65 km/s/Mpc. The model explains most of the variation in the data, indicating that a straight-line relationship provides a perfect description of the nearby universe.
Residual analysis indicates that the model performs extremely well overall, although minor deviations are observed at the most considerable distances included in the study. These deviations likely reflect increasing uncertainty and the gradual breakdown of the simple linear approximation rather than a failure of the model itself.

The quadratic model provides only a slight improvement in numerical error measures. Given the added complexity and lack of clear physical interpretation, the linear model remains the preferred choice for this analysis.


**Discussion**

This project demonstrates that simple data mining techniques can successfully reproduce one of the most critical relationships in cosmology. The estimated value of the Hubble Constant is broadly consistent with commonly cited local measurements, though it is slightly lower than some published values. This highlights how sensitive such estimates can be to data selection and modeling decisions.

An important takeaway is that more complex models are not always better. While they may slightly improve statistical metrics, they can also reduce interpretability and obscure the physical meaning of the results. Careful visualization and residual analysis are essential for understanding when a simple model is sufficient and when it begins to break down.

From a broader perspective, this project demonstrates how data mining methods can be used not only for prediction but also for understanding scientific phenomena. Combining statistical tools with domain knowledge allows for more meaningful and responsible interpretation of real-world data.