# Identifying Companies with Anomlaies

This notebook focuses on detecting and analyzing years when companies were reported to have financial anomalies in their financial statements.

## Data Source 

**Edgar Database:** In our initial phase, we checked the EDGAR database to ensure that financial reports were available for the years corresponding to when the listed companies were reported to have financial anomalies. 

## Steps in Analysis

1. Identifying Anomalies

We gathered a list of companies along with specific years during which they had reported anomalies We then reviewed past literature and regulatory findings to compile a list of companies with notable discrepancies in their financial reports. The output is **companies_with_anomalies** list which includes *cik_str*.

2. Data Retrieval from EDGAR

We matched the companies in our list with their corresponding CIK numbers to fetch detailed reports from EDGAR. We used the *cik_str* from our list to query the EDGAR database and retrieve relevant financial data. We Created a function **get_related_tickers** to find all tickers associated with a given CIK number from our database, ensuring comprehensive data retrieval for each company.

3. Data Preparation

We organised the data into a structured format suitable for analysis and compiled a DataFrame that includes the **CIK number, scandal year, ticker, company name and the ticker and a description at the time the anomaly occured.** 



### Import Libraries

In [8]:
import pandas as pd

### 1. Identifying Anomalies

In [1]:
# List of companies found in EDGAR and the date of the alleged anomaly
companies_with_anomalies = [
    ("5272", "AIG", "2008", "Financial Misrepresentaion"),
    ("310522", "FNM", "2004", "Accounting Irregularities"),
    ("18926", "Q (Quest Communications)", "2002", "Accounting Fraud."),
    ("72971", "WFC", "2016", "Consumer Fraud"),
    ("1159508", "DB", "2015", "Wire Fraud"),
    ("1114446", "UBS", "2011", "Tax Evasion"),
    ("14272", "BMY", "2002", "Financial Misrepresentation"),
    ("831001", "C", "2008", "Security Fraud"),
    ("19617", "JPM", "2012", "Trading Fraud"),
    ("1131399", "GSK", "2012", "Health Care Fraud"),
    ("312070", "BARC", "2012", "Involved in the Libor rate-manipulation scandal"),
    ("1119639", "PBR", "2014", "Petrobras"),
    ("96223", "LUK", "2005", "Leucadia National (involved in Refco scandal, now part of Jefferies Financial Group, ticker 'JEF')"),
    ("18926", "LUMN", "2002", "Lumen Technologies (formerly CenturyLink)"),
    ("47217", "HPQ", "2012", "HP Inc. (involved in the aftermath of Autonomy scandal)"),
    ("1490281", "GRPN", "2012", "Groupon (accounting irregularities post-IPO)"),
    ("1119639", "PETR4.SA", "2014", "Petrobras (Brazil)"),
    ("1767582", "LKI", "2020", "Luckin Coffee (China)"),
    ("1467858", "GM", "2014", "General Motors (U.S., ignition switch scandal)"),
    ("789019", "MSFT", "2001", "Microsoft (Accounting investigations)"),
    ("27419", "TGT", "2008", "Target Corporation (Data breach)"),
    ("833444", "JCI", "2005", "Johnson Controls (Bribery scandal)"),
    ("354950", "HD", "2010", "Home Depot (Data breach)"),
    ("1300514", "LVS", "2013", "Las Vegas Sands (Money laundering)"),
    ("320193", "AAPL", "2014", "Apple Inc. (Antitrust investigations)"),
    ("70858", "BAC", "2008", "Bank of America (Mortgage fraud)"),
    ("1318605", "TSLA", "2018", "Tesla Inc. (Elon Musk's misleading tweets)"),
    ("1018724", "AMZN", "2020", "Amazon (Price gouging investigations)"),
    ("1652044", "GOOG", "2019", "Google (Antitrust investigations)"),
    ("50863", "INTC", "2014", "Intel (Antitrust investigations)"),
    ("51143", "IBM", "2005", "IBM (Bribery scandal)"),
    ("77476", "PEP", "2016", "PepsiCo (Bribery scandal)"),
    ("829224", "SBUX", "2018", "Starbucks (Racial bias scandal)"),
    ("40545", "GE", "2009", "General Electric (Accounting fraud)"),
    ("1065280", "NFLX", "2013", "Netflix (Insider trading)"),
    ("2488", "AMD", "2009", "AMD (Antitrust lawsuit)"),
    ("858877", "CSCO", "2013", "Cisco Systems (Whistleblower case)"),
    ("1341439", "ORCL", "2011", "Oracle (Bribery scandal)"),
    ("804328", "QCOM", "2018", "Qualcomm (Antitrust lawsuit)"),
    ("1065088", "EBAY", "2014", "eBay (Data breach)"),
    ("1633917", "PYPL", "2019", "PayPal (Consumer fraud)"),
    ("1512673", "SQ", "2019", "Square (Data breach)"),
    ("1759509", "LYFT", "2019", "Lyft (Labor practices investigation)"),
    ("100493", "TSN", "2019", "Tyson Foods (Labor practices investigation)"),
    ("63754", "MKC", "2018", "McCormick & Company (Price-fixing scandal)"),
    ("91419", "SJM", "2017", "J.M. Smucker Company (Misleading labeling)"),
    ("1637459", "KHC", "2019", "Kraft Heinz (Accounting scandal)"),
    ("1103982", "MDLZ", "2017", "Mondelez International (Price-fixing)"),
    ("60086", "L", "2015", "Loews Corporation (Environmental violations)"),
    ("1041061", "YUM", "2013", "Yum! Brands (Food safety scandal)"),
    ("1286681", "DPZ", "2018", "Domino's Pizza (Wage theft)"),
    ("901491", "PZZA", "2017", "Papa John's (Racial slur controversy)"),
    ("703351", "EAT", "2015", "Brinker International (Labor practices investigation)"),
    ("887596", "CAKE", "2019", "Cheesecake Factory (Wage theft)"),
    ("1013488", "BJRI", "2020", "BJ's Restaurants (Wage theft)"),
    ("1171759", "RRGB", "2018", "Red Robin Gourmet Burgers (Wage theft)"),
    ("1525769", "PLAY", "2019", "Dave & Buster's (Wage theft)")
]
    

In [2]:
# Count of number of unique companies
len(companies_with_anomalies)

57

### 2. Data Retrieval from EDGAR

In [4]:
# Load the Company_Data CSV file
company_data = pd.read_csv('company_data.csv')

# Check column names
print("Columns in company_data:", company_data.columns.tolist())

Columns in company_data: ['Unnamed: 0', 'cik_str', 'ticker', 'title', 'sic', 'sicDescription', 'category', 'entityType', 'exchanges', 'fiscalYearEnd', 'stateOfIncorporation']


In [5]:
# Function to retrieve all tickers associated with a given CIK number
def get_related_tickers(cik_str, data):
    # Ensure CIK is treated as integer 
    related_data = data[data['cik_str'] == int(cik_str)]
    return related_data

### 3. Data Preparation

In [6]:
# Define the anomaly data list
anomaly_data = []

# Loop to populate anomaly data
for cik_str, anomaly_ticker, scandal_year, detail in companies_with_anomalies:
    related_tickers_data = get_related_tickers(cik_str, company_data)
    if not related_tickers_data.empty:
        for index, row in related_tickers_data.iterrows():
            anomaly_data.append({
                'cik_str': cik_str,
                'Scandal Year': scandal_year,
                'Ticker': row['ticker'],
                'Company Name': row['title'],
                'Anomaly Ticker': anomaly_ticker,
                'Detail': detail
            })

# Convert the list of dictionaries into a DataFrame
anomaly_df = pd.DataFrame(anomaly_data)

anomaly_df

Unnamed: 0,cik_str,Scandal Year,Ticker,Company Name,Anomaly Ticker,Detail
0,5272,2008,AIG,"AMERICAN INTERNATIONAL GROUP, INC.",AIG,Financial Misrepresentaion
1,5272,2008,AIG-PA,"AMERICAN INTERNATIONAL GROUP, INC.",AIG,Financial Misrepresentaion
2,310522,2004,FNMA,FEDERAL NATIONAL MORTGAGE ASSOCIATION FANNIE MAE,FNM,Accounting Irregularities
3,310522,2004,FNMAS,FEDERAL NATIONAL MORTGAGE ASSOCIATION FANNIE MAE,FNM,Accounting Irregularities
4,310522,2004,FNMAH,FEDERAL NATIONAL MORTGAGE ASSOCIATION FANNIE MAE,FNM,Accounting Irregularities
...,...,...,...,...,...,...
168,703351,2015,EAT,"BRINKER INTERNATIONAL, INC",EAT,Brinker International (Labor practices investi...
169,887596,2019,CAKE,CHEESECAKE FACTORY INC,CAKE,Cheesecake Factory (Wage theft)
170,1013488,2020,BJRI,BJs RESTAURANTS INC,BJRI,BJ's Restaurants (Wage theft)
171,1171759,2018,RRGB,RED ROBIN GOURMET BURGERS INC,RRGB,Red Robin Gourmet Burgers (Wage theft)


In [10]:
# Save DataFrame to a CSV file
anomaly_df.to_csv('companies_with_anomalies.csv', index=False)


### Next Steps

We plan to query the tickers from our DataFrame against the EDGAR database to retrieve specific financial statements from the anomaly years. Each financial statement retrieved will be reviewed to validate the anomalies and then used for training our model.
