# 📊 Data Analysis Assignment

## Objective
- Load and analyze a dataset using Pandas.
- Perform basic statistics and grouping.
- Visualize insights with Matplotlib and Seaborn.

Dataset used: **CRMreport.csv** (customer subscription, billing, and package details).


## Task 1: Load & Explore the Dataset


```
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

# Load dataset

df = pd.read_csv("../data/CRMreport.csv")
df.head()
```

# Dataset info
```
df.info()
```

# Check missing values
```
df.isnull().sum()
```

# Drop duplicates and clean column names
```
df = df.drop_duplicates()
df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")
df.head()
```

## Task 2: Basic Data Analysis


# Summary statistics
```
df.describe(include="all")
```

# Grouping example: average DAYS per package
```
if "pkg" in df.columns and "days" in df.columns:
    avg_days_pkg = df.groupby("pkg")["days"].mean().sort_values(ascending=False)
    avg_days_pkg
```

## Task 3: Data Visualization
```
We will create four plots:
1. Line chart (installs over time)  
2. Bar chart (average days per package)  
3. Histogram (distribution of days)  
4. Scatter plot (days vs band)  
```

```
sns.set(style="whitegrid")
os.makedirs("../outputs", exist_ok=True)
```

# 1. Line chart: Installs over time
```
if "installdate" in df.columns:
    df["installdate"] = pd.to_datetime(df["installdate"], errors="coerce")
    installs_over_time = df.groupby(df["installdate"].dt.to_period("M")).size()
    installs_over_time.plot(kind="line", marker="o")
    plt.title("Installations Over Time")
    plt.xlabel("Month")
    plt.ylabel("Number of Installs")
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.savefig("../outputs/line.png")
    plt.show()
```

# 2. Bar chart: Average DAYS per package
```
if "pkg" in df.columns and "days" in df.columns:
    sns.barplot(x="pkg", y="days", data=df, estimator="mean", ci=None, order=df["pkg"].value_counts().index)
    plt.title("Average DAYS per Package")
    plt.xlabel("Package")
    plt.ylabel("Average DAYS")
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.savefig("../outputs/bar.png")
    plt.show()
```

# 3. Histogram: Distribution of DAYS
```
if "days" in df.columns:
    plt.hist(df["days"].dropna(), bins=20, edgecolor="black")
    plt.title("Distribution of DAYS")
    plt.xlabel("Days")
    plt.ylabel("Frequency")
    plt.tight_layout()
    plt.savefig("../outputs/histogram.png")
    plt.show()
```

# 4. Scatter plot: Days vs Band
```
if "days" in df.columns and "band" in df.columns:
    sns.scatterplot(x="days", y="band", data=df)
    plt.title("Days vs Band")
    plt.xlabel("Days")
    plt.ylabel("Band")
    plt.tight_layout()
    plt.savefig("../outputs/scatter.png")
    plt.show()
```

## Findings / Observations

- The dataset shows how customers are distributed across packages, franchises, and teams.  
- Some packages keep customers for longer average `days` than others.  
- Installations over time reveal trends (growth, decline, or seasonality).  
- Scatter plot of `days` vs `band` helps identify usage patterns or anomalies.  
