# Analysis OCM --> Ajah

Context: GivingTuesday frequently looks at government and non-profit data to uncover insights and
drive decision-making. We would like you to take a look at some data and deliver a report describing
your initial data exploration and ideas for future work. This report should:
● Summarize the data,
● Describe some interesting patterns and/or trends in the data,
● Outline what a future research/business project might look like and why you think it's important.

Expectation: Spend 2-3 hours ingesting data and analyzing. Use Python or SQL in analysis.

Objective: Test technical skills, but also (more importantly) research/business problem-solving.

Dataset:
● IRS Tax Extract and EOBMF
● US nonprofits are tax-exempt organizations, who must publicly report their taxes. The datasets
above summarize tax returns (tax extract) and organizations (EOBMF).

Description:
● Your job is to choose one or both files, and:
○ Quickly characterize the “grain” (what are rows unique to)
○ Quickly characterize the broad categories of data available (not by data type, but at the
semantic / analysis level)
○ Define an interesting business or research problem to solve
○ Quickly characterize the dataset (via summary stats or graphs), to generally characterize
the data you’ll be using for analysis.
○ Provide tabular and graphical breakdowns (at least one of each) of the data, along with
written commentary, to begin addressing that problem.

Expected outputs:
● Notebook containing code (Python or SQL), as well as tabular, graphical, and text outputs.
● Text outputs should:
○ Describe the dataset used (which file(s), grain & categories of data)
○ Describe the problem and why you think it’s important
○ Describe each analysis output - what it is and how it informs the business problem

Criteria we’ll be reviewing
● Overall coding and notebook approach
● Justification of what problem you choose
● Analytic approach to solving the problem
● Ability to communicate, in words, your findings
● Overall sense of quality, prioritization, approach, and technical skill

Desirable skills:
● Translating business problems into analyses
● Writing up and communicating results to technical and non-technical audiences
● Building notebooks or dashboards to make analysis repeatable (as needed)

## Libraries & WD

### Libraries

In [1]:
# Data manipulation
# ==============================================================================

import os 
import pandas as pd
import numpy as np
import openpyxl
import operator
from joblib import dump
from joblib import dump


# Plots
# ==============================================================================
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')
plt.rcParams['lines.linewidth'] = 1.5
plt.rcParams['font.size'] = 10

from sklearn.model_selection import train_test_split, learning_curve, validation_curve, learning_curve
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, roc_curve, auc, confusion_matrix

# Modeling 
# ==============================================================================

from sklearn.model_selection import train_test_split, learning_curve, validation_curve
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, accuracy_score, precision_score, recall_score, f1_score, roc_auc_score


# Warnings configuration
# ==============================================================================
import warnings
warnings.filterwarnings('once')

### Set Working directory

In [2]:
# Set the working directory

workspace_directory = "/Users/oskyroski/DataScience/Jobs/Ajah"
os.chdir(workspace_directory)

### Ingest data files

In [3]:
# Leer el archivo Excel (.xlsx)
df_excel = pd.read_excel('22eoextract990.xlsx')

# Leer el archivo CSV
df_csv = pd.read_csv('sit-2022.csv')

  df_csv = pd.read_csv('sit-2022.csv')


## Session information

In [4]:
import session_info
session_info.show(html=False)

-----
joblib              1.3.2
matplotlib          3.8.3
numpy               1.26.1
openpyxl            3.1.2
pandas              2.1.1
seaborn             0.13.2
session_info        1.0.0
sklearn             1.4.1.post1
-----
IPython             8.16.1
jupyter_client      8.5.0
jupyter_core        5.4.0
-----
Python 3.11.6 (v3.11.6:8b6ee5ba3b, Oct  2 2023, 11:18:21) [Clang 13.0.0 (clang-1300.0.29.30)]
macOS-13.5.2-x86_64-i386-64bit
-----
Session information updated at 2024-04-01 10:58
