Predictive Analysis

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import ttest_ind

# 1. Data Retrieval
df = pd.read_csv("predictive_analysis.csv", encoding='latin1')

# 2. Data Cleaning
df = df.drop_duplicates()
df = df.dropna()
df['Date'] = pd.to_datetime(df['Date'])

# 3. Exploratory Data Analysis (EDA)
print("First few rows of the DataFrame:")
print(df.head())

print("\nSummary statistics:")
print(df.describe())

correlation_matrix = df.corr()
print("\nCorrelation matrix:")
print(correlation_matrix)

df.hist(figsize=(10, 10))
plt.show()

sns.countplot(x='Location', data=df)
plt.xticks(rotation=90)
plt.show()

# 4. Data Visualization
sns.boxplot(x='Location', y='Energy Consumed (kWh)', data=df)
plt.xticks(rotation=90)
plt.show()

sns.scatterplot(x='Temperature (°C)', y='Energy Consumed (kWh)', data=df)
plt.show()

# 5. Statistical Analysis
group1 = df[df['Weather Condition'] == 'Sunny']['Energy Consumed (kWh)']
group2 = df[df['Weather Condition'] == 'Rainy']['Energy Consumed (kWh)']
t_stat, p_value = ttest_ind(group1, group2)
print("t-statistic:", t_stat)
print("p-value:", p_value)

# 6. Data Presentation
df.to_csv("cleaned_and_analyzed_data.csv", index=False)


First few rows of the DataFrame:
  Charging Station ID               Location                          Date  \
0               CS001        Connaught Place 1970-01-01 00:00:00.000045383   
1               CS002        South Extension 1970-01-01 00:00:00.000045384   
2               CS003              Cyber Hub 1970-01-01 00:00:00.000045385   
3               CS004  Indira Gandhi Airport 1970-01-01 00:00:00.000045386   
4               CS005         DLF Cyber City 1970-01-01 00:00:00.000045387   

   Uptime (%)  Downtime (hours)  Usage Patterns (%)  Energy Consumed (kWh)  \
0        97.8               1.2                78.5                  234.6   
1        99.2               0.8                82.3                  210.8   
2        95.6               2.4                75.9                  198.3   
3        98.5               1.5                80.2                  185.7   
4        96.3               3.7                72.8                  221.5   

   Total Sessions  Average Se

ValueError: could not convert string to float: 'CS001'

In [None]:
import pandas as pd

# Read the CSV file into a DataFrame with specified encoding
df = pd.read_csv("predictive_analysis.csv", encoding='latin1')

# Select only numeric columns
numeric_df = df.select_dtypes(include=['float64', 'int64'])

# Calculate summary statistics
print("Summary Statistics:")
print(numeric_df.describe())

# Calculate correlation matrix
correlation_matrix = numeric_df.corr()
print("\nCorrelation matrix:")
print(correlation_matrix)


Summary Statistics:
               Date  Uptime (%)  Downtime (hours)  Usage Patterns (%)  \
count     35.000000   35.000000         35.000000           35.000000   
mean   45400.000000   97.820000          2.094286           79.134286   
std       10.246951    1.155753          1.101321            4.281557   
min    45383.000000   95.600000          0.500000           70.500000   
25%    45391.500000   96.950000          1.150000           75.900000   
50%    45400.000000   97.900000          2.000000           79.400000   
75%    45408.500000   98.850000          2.950000           82.500000   
max    45417.000000   99.500000          4.400000           86.300000   

       Energy Consumed (kWh)  Total Sessions  \
count              35.000000       35.000000   
mean              216.328571       49.228571   
std                14.775329        2.921379   
min               185.700000       44.000000   
25%               204.150000       47.000000   
50%               217.700000      

Based on the summary statistics and correlation matrix, we can make the following inferences:

1. **Summary Statistics:**
   - **Count:** This tells us the number of non-null values in each column. It helps us identify if there are any missing values in the dataset.
   - **Mean:** The average value of each numeric column gives us an idea of the central tendency of the data.
   - **Standard Deviation:** This measures the dispersion or spread of the data points around the mean. A higher standard deviation indicates greater variability.
   - **Minimum and Maximum:** These values give us the range of values observed in each column.
   - **25th, 50th (median), and 75th percentiles:** These percentiles help us understand the distribution of the data and identify any potential outliers.

2. **Correlation Matrix:**
   - The correlation matrix shows the correlation coefficients between pairs of numeric variables in the dataset.
   - A correlation coefficient close to 1 indicates a strong positive correlation, close to -1 indicates a strong negative correlation, and close to 0 indicates no correlation.
   - Positive correlations suggest that as one variable increases, the other tends to increase as well. Negative correlations suggest that as one variable increases, the other tends to decrease.
   - We can use the correlation matrix to identify relationships between variables. For example, if we're predicting energy consumption, we might find that temperature has a strong positive correlation with energy consumed, indicating that hotter temperatures lead to higher energy usage.

Based on these analyses, we can draw insights into the relationships between different variables in the dataset and make informed decisions or further investigations.