This project is a Python-based data analysis and visualization assignment that demonstrates fundamental skills in using the pandas and matplotlib libraries. The goal is to load, clean, analyze, and visualize a dataset to uncover insights and patterns.
The project fulfills the following objectives:
- Loading and exploring a dataset using pandas.
- Performing basic statistical analysis and data cleaning.
- Creating a variety of plots and charts with matplotlib to visualize key findings.
This project uses a sample dataset that mimics the Iris dataset, a classic dataset for classification problems. The dataset contains information about various measurements of different iris flower species.
data_analysis.py: The main Python script containing all the code for data loading, analysis, and visualization.README.md: This file, providing an overview and instructions for the project.
-
Clone the Repository:
git clone [https://github.com/your-username/Data-Analysis-Project.git](https://github.com/your-username/Data-Analysis-Project.git) cd Data-Analysis-Project -
Install Required Libraries: Make sure you have
pandasandmatplotlibinstalled. If not, you can install them usingpip:pip install pandas matplotlib numpy
-
Execute the Script: Run the Python script from your terminal. This will perform the analysis and display the generated plots.
python data_analysis.py
The script performs the following key analyses:
- Data Exploration: Checks for missing values and data types. Missing numerical values are filled with the mean of their respective columns.
- Descriptive Statistics: Computes the mean, median, standard deviation, and other statistics for numerical columns to summarize the data.
- Grouped Analysis: Calculates the average
sepal_lengthandpetal_lengthfor each species, revealing significant differences between the groups.
The script generates four plots to visualize the data:
- Scatter Plot: Shows the relationship between
sepal_lengthandpetal_length, which clearly separates the species. - Bar Chart: Compares the average
petal_lengthacross different species. - Histogram: Illustrates the distribution of
sepal_widthvalues. - Line Chart: Shows a trend of
sepal_lengthover a dummy time-series, demonstrating how to plot data over time.
Feel free to explore the data_analysis.py file to see the code in detail. Contributions and feedback are welcome!