## Deepika Gusain
## Module: Programming and Scripting
## Project: Iris Data Analysis  
This report is submitted as part of the final project requirement for the Programming and Scripting module. It involves a deep analysis and implementation of Fisher's Iris Dataset using Python, including research, code development, statistical analysis, visualisation, and thorough documentation.

### 1. Introduction 
This project has been designed to give practical exposure to the real-world use of scripting and data analysis. The Iris dataset is chosen as it is foundational in machine learning and statistics. The report outlines the detailed steps, challenges, and achievements from beginning to the end of the project. A strong focus has been placed on the modular programming, version control, research quality, and presentation clarity. 

### 2. Research Investigations  
Extensive research was conducted on the Iris dataset from multiple reliable sources, including the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/iris), academic literatures (https://doi.org/10.1111/j.1469-1809.1936.tb02137.x), and Python documentation for libraries such as Pandas (https://pandas.pydata.org/docs/), Seaborn (https://seaborn.pydata.org/), and Matplotlib (https://matplotlib.org/stable/contents.html).
Ronald Fisher introduced this dataset in 1936 to demonstrate linear discriminant analysis. The dataset includes 150 samples of iris flowers from three species. Each sample records four measurements: sepal length, sepal width, petal length, and petal width.  
In addition to data structure, background on the application of this dataset in classification problems and statistical modeling was investigated. Sources were cross-referenced to ensure accuracy, and all data references are fully cited. The research also covered visualization techniques and best practices in scientific reporting.

### 3. Development and Implementation
The project was implemented entirely in Python. The script `analysis.py` was developed to perform the following:
- Load the Iris dataset using Seaborn
- Save the dataset as `iris.csv`
- Generate and save summary statistics to `variable_summary.txt`
- Plot and save histograms for each numerical feature
- Create and save scatter plots for all feature pairs

The code was developed with clarity and maintainability in mind. Each block of logic is accompanied by comments. Functions were reused where applicable, and filenames were dynamically generated to reduce redundancy. The code was tested and refined iteratively, with each improvement committed to the version control system.

### 4. Consistency and Version Control
GitHub was used to track every aspect of the development process. The repository (`pands-project`) was structured with a meaningful commit history. Each commit was atomic, addressing only one change or feature at a time. This helped maintain clarity and supported collaborative best practices.

The following structure was maintained:
- Regular commits
- Clear commit messages
- Use of `.gitignore`
- Inclusion of `README.md`, `requirements.txt`, and data/plots

Issues and milestones were optionally used to keep track of progress. This consistent approach ensured a traceable project workflow and reflects a professional development attitude. commit was atomic, addressing only one change or feature at a time. This helped maintain clarity and supported collaborative best practices.

### 5. Documentation and Readability
This report and the accompanying README file were written to high academic standards. Every section, from introduction to conclusion, is designed to be understandable even to a reader with a minimal technical background.

The README includes:
- Project purpose
- Dataset background
- Installation and execution instructions
- Detailed analysis overview
- Visualizations and insights
- References

The code is fully commented and formatted to improve readability. Variable names are descriptive, and plotting functions include titles, labels, and legends for clarity.  clarity. 

### 6. Data Analysis Techniques
Exploratory Data Analysis (EDA) techniques such as summary statistics, histograms, and scatter plots were used to gain insights into the Iris dataset. This provided a better understanding of distributions, relationships, and potential feature importance for classification tasks.

In particular, petal length and petal width showed strong separation between the species. Scatter plots using hue for species made this trend visually clear. Histograms were used to verify normality and detect outliers in measurement distributions. Exploratory Data Analysis (EDA) techniques such as summary statistics, histograms, and scatter plots were used to gain insights into the Iris dataset. This provided a better understanding of distributions, relationships, and potential feature importance for classification tasks.


### 7. Visualizations
Seaborn and Matplotlib were used to generate plots. The visualizations created include:
- Histogram of each feature (sepal/petal length and width)
- Pairwise scatter plots (e.g., sepal length vs sepal width)

All visuals were saved into a 'plots' directory and named consistently. The plots followed design best practices: readable labels, color coding by species, and appropriate use of figure sizes.

### 8. Code Quality and Structure
The code was structured into reusable, logical blocks. For example, functions were used to generate plots in loops. All outputs were dynamically saved using formatted strings.

Code comments and modular structure ensure that anyone reading or using the code can follow and replicate the analysis with ease. The script follows Pythonic conventions and has been tested to run without errors. The code was structured into reusable, logical blocks. For example, functions were used to generate plots in loops. All outputs were dynamically saved using formatted strings.

### 9. Learning Reflections
This project significantly contributed to the development of scripting, version control, and analytical skills. Understanding how to structure a project, keep it organized, and present it to others was one of the most valuable takeaways.

This experience highlighted the importance of research before implementation, testing throughout development, and writing user-friendly documentation. From using Git efficiently to visualizing real-world data, the skills learned here are applicable far beyond this assignment. 


### 10. Conclusion
The project met all objectives as defined in the brief. From loading the data, conducting EDA, creating visual outputs, and documenting the process, each part has been completed to a high standard with special focus on research, coding, and communication. This report, along with the GitHub repository, serves as proof of a structured and thorough approach to problem-solving using programming and scripting.  coding, and communication. This report, along with the GitHub repository, serves as proof of a structured and thorough approach to problem-solving using programming and scripting.

### 11. References
- Fisher, R.A. (1936). The use of multiple measurements in taxonomic problems.
- UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/iris
- Python: https://www.python.org/
- Pandas: https://pandas.pydata.org/
- Seaborn: https://seaborn.pydata.org/
- Matplotlib: https://matplotlib.org/
- GitHub Documentation: https://docs.github.com/- Fisher, R.A. (1936). The use of multiple measurements in taxonomic problems.

In [7]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import itertools
import os

# Load dataset
iris = sns.load_dataset('iris')

# Save to CSV (if needed)
iris.to_csv('iris.csv', index=False)

# Create output directories
os.makedirs('plots', exist_ok=True)

# Summary stats
summary = iris.describe()
summary.to_csv('variable_summary.txt', sep='\t')

# Histograms
for column in iris.columns[:-1]:
    plt.figure()
    sns.histplot(iris[column], kde=True, bins=20)
    plt.title(f'Histogram of {column}')
    plt.savefig(f'plots/histogram_{column}.png')
    plt.close()

# Scatter plots for all pairs
for x, y in itertools.combinations(iris.columns[:-1], 2):
    plt.figure()
    sns.scatterplot(data=iris, x=x, y=y, hue='species')
    plt.title(f'Scatter Plot: {x} vs {y}')
    plt.savefig(f'plots/scatter_{x}_vs_{y}.png')
    plt.close()

print("Analysis complete. Outputs saved.")

Analysis complete. Outputs saved.
