In [None]:
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Sales Performance Analysis\n",
    "\n",
    "This project demonstrates basic statistical analysis on a sales dataset to showcase data analysis skills for a Data Analyst position. We use descriptive statistics, visualizations, correlation analysis, grouped statistics, and hypothesis testing."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Import Libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "from scipy import stats\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load the Dataset\n",
    "\n",
    "Load the `sales_data_sample.csv` file downloaded from Kaggle."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv('sales_data_sample.csv', encoding='latin1')\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Exploration"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.info()\n",
    "df.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Handling Missing Values\n",
    "\n",
    "Fill missing values in non-critical columns to prepare for analysis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['ADDRESSLINE2'].fillna('', inplace=True)\n",
    "df['STATE'].fillna('Unknown', inplace=True)\n",
    "df['POSTALCODE'].fillna('Unknown', inplace=True)\n",
    "df['TERRITORY'].fillna('Unknown', inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Descriptive Statistics\n",
    "\n",
    "Compute basic statistics for key numerical columns like SALES and QUANTITYORDERED."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('Mean Sales:', df['SALES'].mean())\n",
    "print('Median Sales:', df['SALES'].median())\n",
    "print('Standard Deviation of Sales:', df['SALES'].std())\n",
    "\n",
    "print('\\nMean Quantity Ordered:', df['QUANTITYORDERED'].mean())\n",
    "print('Median Quantity Ordered:', df['QUANTITYORDERED'].median())\n",
    "print('Standard Deviation of Quantity Ordered:', df['QUANTITYORDERED'].std())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Visualization - Sales Distribution\n",
    "\n",
    "Plot a histogram to visualize the distribution of sales values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize=(10, 6))\n",
    "sns.histplot(df['SALES'], kde=True)\n",
    "plt.title('Distribution of Sales')\n",
    "plt.xlabel('Sales')\n",
    "plt.ylabel('Frequency')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Correlation Analysis\n",
    "\n",
    "Create a heatmap to show correlations between numerical variables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "numeric_df = df.select_dtypes(include=[np.number])\n",
    "corr = numeric_df.corr()\n",
    "plt.figure(figsize=(12, 8))\n",
    "sns.heatmap(corr, annot=True, cmap='coolwarm')\n",
    "plt.title('Correlation Matrix')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Grouped Statistics\n",
    "\n",
    "Group by PRODUCTLINE and compute mean, median, and standard deviation of SALES."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "grouped = df.groupby('PRODUCTLINE')['SALES'].agg(['mean', 'median', 'std'])\n",
    "print(grouped)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Hypothesis Testing\n",
    "\n",
    "Perform a t-test to check if there's a significant difference in sales between the USA and other countries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "us_sales = df[df['COUNTRY'] == 'USA']['SALES']\n",
    "other_sales = df[df['COUNTRY'] != 'USA']['SALES']\n",
    "\n",
    "t_stat, p_val = stats.ttest_ind(us_sales, other_sales)\n",
    "print('T-statistic:', t_stat)\n",
    "print('P-value:', p_val)\n",
    "\n",
    "if p_val < 0.05:\n",
    "    print('There is a significant difference in sales between USA and other countries.')\n",
    "else:\n",
    "    print('No significant difference.')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}