This Python script showcases an analysis of insurance-related data and demonstrates linear regression modeling. It employs various libraries, including numpy, pandas, matplotlib, seaborn, and scikit-learn, to perform data exploration and modeling. The primary steps are as follows:
Table of Contents
- Introduction
- Installation
- Usage
- Data Source
- Data Visualization
- Linear Regression Model
- Visualization of Regression Results
- Screenshots
The main objective of this data is exploring and analyzing a dataset related to insurance information and conducting a simple linear regression analysis to predict BMI based on age. It covers the following tasks:
● Loading data from a CSV file. ● Data cleaning and preprocessing. ● Creating various plots using Matplotlib and Seaborn for data visualization. ● Implementing a linear regression model to predict bmi according age group.
-
Clone this repository: git clone https://github.com/Keshajani12/Insurance-Data-Analysis-Using-Python.git
-
Navigate to the project directory: cd Insurance
-
Install the required Python packages using pip: pip install pandas numpy matplotlib seaborn scikit-learn
Download Zip and Install requirements.txt write command : pip install -r requirements.txt
-
Run the Python script: python insurance.py
-
The script will load the Insurance data, perform analysis, generate plots, and display them.
The script starts by loading an insurance dataset from a CSV file named 'insurance.csv'. The dataset contains essential information such as 'age,' 'sex,' 'bmi,' 'region,' and 'charges' for individuals.
Data Exploration Basic information about the dataset is displayed, including the first few rows, shape, and statistical summary. Data is divided into two age groups: 'oldAge' (age >= 55) and 'youngAge' (age < 55). Within these groups, data is further segmented based on gender into 'oldAgeMale,' 'oldAgeFemale,' 'youngAgeMale,' and 'youngAgeFemale.' The size and sample of each subgroup are also presented.
Several data visualization techniques are employed to gain insights:
● Barplot: A barplot is created using Seaborn to compare 'bmi' among 'oldAge' individuals, distinguished by gender and region. ● Bar Chart: A bar chart is generated to visualize the frequency of smokers within different age groups. ● Lineplot: Seaborn is used to produce a lineplot that illustrates the relationship between 'age' and 'charges,' with hue differentiation by gender. ● Violinplot: A violinplot is created to display the distribution of 'charges' across different regions. ● Countplot: Seaborn's countplot is utilized to visualize the count of individuals within specific 'age' groups, segregated by gender. ● Histplot: A histogram plot is generated to visualize the distribution of 'bmi' values with specified edge color and fill color.
The script proceeds to perform a linear regression analysis to predict 'bmi' based on 'age': ● Data is split into training and testing sets using the train_test_split function. ● A Linear Regression model is trained using the training data. ● Predictions are made on the test data using the trained model. ● The Mean Squared Error (MSE) is computed to evaluate the model's performance.
● A DataFrame named 'record' is created to hold both the actual 'BMI' values and the predicted 'BMI' values from the regression model. ● A line plot is generated using Seaborn to visually compare the actual and predicted 'BMI' values against 'age.' The plot provides insights into how well the regression model approximates 'BMI' based on 'age.' ● This script serves as a comprehensive example of data analysis and linear regression modeling, offering valuable insights into the provided insurance dataset. It can be adapted for similar regression tasks and serves as an educational resource for data analysis enthusiasts and aspiring data scientists.