### Mount Google Drive

This cell mounts your Google Drive to the Colab environment. This allows the notebook to access files stored in your Google Drive, such as the `txt_converted` folder and `tailored_metadata.csv`.

In [10]:
from google.colab import drive
drive.mount('/content/drive/')

import os

# List everything in your main Drive folder
for item in os.listdir('/content/drive/MyDrive/'):
    print(item)

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).
Colab Notebooks
Namnlös mapp (1)
Namnlöst dokument (25).gdoc
Namnlöst dokument - Linjediagram 1.gsheet
Namnlöst dokument (24).gdoc
presentation.gdoc
Usa´s Statsskick.gdoc
Namnlöst dokument (23).gdoc
Argument.gdoc
svar på Diana- dyrkad gudinna - Asma osma TE17.gdoc
Beskrivning av linjär funktion med ord- Asma osman Te17.gdoc
Kristina-Asma.gdoc
Book Review.gdoc
Asma- genomgång inför prov .gdoc
Kejsarn-Asma.gdoc
plaster och gummi- Asma .gdoc
 KEMI- genomgån inför prov .gdoc
 Sop Asma.gdoc
Ordning och reda .gdoc
Nuturekonomi .gdoc
Integratio & Religion, Debattartikel (asma) (1).gdoc
Ergnomi ahmed te -17.gdoc
Asma te17- Media Loga .gdoc
Bok - Asma.gdoc
Asma Te17 Vetenskap & religion .gdoc
Asma osman Te 17-Integration & Religion  .gdoc
kom i tid.gdoc
Religion.gdoc
käll.gdoc
Asma _Etik & Moral:.gdoc
Asma TE17- Ordning & reda.gdoc
Finding home.gdoc
DATE

In [11]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Import Libraries

This cell imports necessary Python libraries:
- `pandas` as `pd`: For data manipulation and analysis.
- `os`: For interacting with the operating system, like listing directory contents.
- `re`: For regular expressions, used here for pattern matching in filenames.
- `unicodedata`: For handling Unicode characters, particularly useful for normalizing text with diacritics.

In [12]:

import pandas as pd
import os
import re
import unicodedata

### Extract Company Names and Years from Filenames

This cell processes the filenames in the specified `txt_converted` folder. It extracts the company name and publishing year from each filename, assuming a `_YYYY.txt` pattern. It then creates a Pandas DataFrame (`file_df`) containing these extracted details, which will be used for merging with metadata.

In [13]:
# Path to your txt_converted folder
folder_path = '/content/drive/MyDrive/Dav_project/txt_converted'
files = os.listdir(folder_path)

company_names = []
publishing_years = []
filenames = []

for filename in files:
    if not filename.endswith('.txt'):
        continue

    base_name = filename[:-4]
    match = re.search(r'(\d{4})$', base_name)

    if match:
        year = int(match.group(1))
        company_name = base_name[:match.start()].strip()

        if company_name.endswith('_') or company_name.endswith('-'):
            company_name = company_name[:-1].strip()

        company_names.append(company_name)
        publishing_years.append(year)
        filenames.append(filename)

file_df = pd.DataFrame({
    'filename': filenames,
    'company_name': company_names,
    'year': publishing_years
})

print(f"Total text files: {len(file_df)}")
print(file_df.head())

Total text files: 5790
                 filename   company_name  year
0  POSCOCHEMTECH_2017.txt  POSCOCHEMTECH  2017
1     PowerChina_2015.txt     PowerChina  2015
2         PostNL_2017.txt         PostNL  2017
3     PowerChina_2016.txt     PowerChina  2016
4     PowerChina_2017.txt     PowerChina  2017


### Load Metadata and Initial Matching

This cell loads the `tailored_metadata.csv` file, which contains additional information about the reports. It then performs an initial merge between the `file_df` (created in the previous step) and the `metadata` DataFrame, using the `filename` column. This step identifies how many files from the `txt_converted` folder successfully match with entries in the metadata.

In [14]:
# Load metadata
metadata = pd.read_csv('/content/drive/MyDrive/Dav_project/tailored_metadata.csv')
print(f"Metadata columns: {metadata.columns.tolist()}")

# Merge on filename
matched = pd.merge(
    file_df,
    metadata,
    left_on='filename',
    right_on='file_full_name',
    how='left'
)

matched_count = matched['Name'].notna().sum()
print(f"Matched: {matched_count} out of {len(file_df)}")
print(f"Match rate: {matched_count/len(file_df)*100:.1f}%")

Metadata columns: ['Name', 'Year', 'file', 'Organization_type', 'Size', 'Sector', 'Sec_SASB', 'Country', 'Region', 'OECD', 'english_non_english', 'file_full_name']
Matched: 5650 out of 5790
Match rate: 97.6%


### Identify Unmatched Files and Analyze Diacritics

After the initial merge, this cell identifies and inspects the files that did not find a match in the metadata. It specifically checks for the presence of diacritics (accents or special characters) in the filenames, as these can often cause mismatch issues. It prints a list of unmatched files and a count of those containing diacritics.

In [15]:
# After Block 3 merge, check what didn't match
unmatched = matched[matched['Name'].isna()].copy()
print(f"\n{'='*60}")
print(f"UNMATCHED FILES: {len(unmatched)}")
print(f"{'='*60}")

if len(unmatched) > 0:
    print("\nFirst 20 unmatched filenames:")
    for idx, row in unmatched.head(20).iterrows():
        print(f"  {row['filename']}")

    print("\nSample of unmatched company names extracted from filenames:")
    print(unmatched['company_name'].value_counts().head(15))

    # Check if these might be Unicode issues
    import unicodedata
    def has_diacritics(text):
        if pd.isna(text):
            return False
        text = str(text)
        normalized = unicodedata.normalize('NFD', text)
        for c in normalized:
            if unicodedata.combining(c):
                return True
        return False

    unmatched['has_diacritics'] = unmatched['filename'].apply(has_diacritics)
    diacritic_count = unmatched['has_diacritics'].sum()
    print(f"\nFiles with diacritics/accents: {diacritic_count}")

    if diacritic_count > 0:
        print("\nExamples of files with diacritics:")
        print(unmatched[unmatched['has_diacritics']]['filename'].head(10).tolist())
else:
    print("✓ All files matched successfully!")


UNMATCHED FILES: 140

First 20 unmatched filenames:
  SonaeIndústria_2015.txt
  VereinigungderÖsterreichischenZementindustrie(VÖZ)_2016.txt
  TelefónicaMovistarEcuador_2014.txt
  TeleféricoSanBernardoSalta_2014.txt
  TÜVRheinlandAG_2015.txt
  UnimeddeCascavel-CooperativadeTrabalhoMédico_2014.txt
  Feldschlösschen_2017.txt
  TIGÁZ_2013.txt
  Trygg-HansaFörsäkrings_2017.txt
  MillenniumBancoComercialPortuguês_2016.txt
  TÜVRheinlandAG_2014.txt
  MillenniumBancoComercialPortuguês_2014.txt
  MillenniumBancoComercialPortuguês_2013.txt
  MineraçãoRiodoNorte_2013.txt
  TÜVRheinlandAG_2016.txt
  TÜVRheinlandAG_2013.txt
  SecretaríaGeneraldelaGobernación-GobiernodelaProvinciadeCórdoba_2014.txt
  OrangeRomânia_2006.txt
  OrangeRomânia_2007.txt
  OrganizaciónCorona_2010.txt

Sample of unmatched company names extracted from filenames:
company_name
DeutscheBörseAG                                            5
TÜVRheinlandAG                                             4
GP

### Re-match Files Using Unicode Normalization

This cell attempts to resolve mismatches caused by Unicode characters. It normalizes both the `filename` and `file_full_name` columns to a common format (removing diacritics) and then performs a re-merge. This step aims to increase the match rate by standardizing character representations. It then updates the `matched` DataFrame with the newly found matches.

In [16]:
print(f"\n{'='*60}")
print(f"ADDING NEWLY MATCHED FILES")
print(f"{'='*60}")

if 'missing_found' in locals() and len(missing_found) > 0:
    print(f"\nFound {len(missing_found)} files via Unicode normalization:")
    print("-" * 40)

    # Show the files being added
    for i, f in enumerate(missing_found[:20]):  # Show first 20
        print(f"  ✅ Adding: {f}")
    if len(missing_found) > 20:
        print(f"  ... and {len(missing_found) - 20} more")

    # Get metadata for found files
    found_metadata = metadata[metadata['file_full_name'].isin(missing_found)].copy()
    found_metadata['filename'] = found_metadata['file_full_name']

    # Get file_df entries
    found_files = file_df[file_df['filename'].isin(missing_found)].copy()

    # Merge
    found_merged = pd.merge(
        found_files,
        found_metadata,
        on='filename',
        how='left'
    )

    # Separate already matched
    already_matched = matched[matched['Name'].notna()].copy()

    # Store counts before
    before_count = len(already_matched)

    # Combine
    matched = pd.concat([already_matched, found_merged], ignore_index=True)

    # Remove any duplicates
    if matched.duplicated(subset=['filename']).any():
        dup_count = matched.duplicated(subset=['filename']).sum()
        print(f"\n⚠️ Found {dup_count} duplicates, removing...")
        matched = matched.drop_duplicates(subset=['filename'])

    print(f"\n{'='*40}")
    print(f"RESULTS:")
    print(f"{'='*40}")
    print(f"Previously matched: {before_count}")
    print(f"Newly added: {len(found_merged)}")
    print(f"Total matched now: {len(matched)}")

    # Verify we have all files
    all_files_count = len(file_df)
    matched_files_count = matched['filename'].nunique()

    if matched_files_count == all_files_count:
        print(f"\n✅ SUCCESS: All {all_files_count} files are now matched!")
    else:
        print(f"\n⚠️ Still missing: {all_files_count - matched_files_count} files")

    # Show if any still missing metadata
    still_no_metadata = matched[matched['Name'].isna()]
    if len(still_no_metadata) > 0:
        print(f"\nFiles without metadata (will be removed later): {len(still_no_metadata)}")
        print("First few:")
        for idx, row in still_no_metadata.head(5).iterrows():
            print(f"  ❌ {row['filename']}")
else:
    print("No files were found in Unicode matching")

print(f"\n{'='*60}")


ADDING NEWLY MATCHED FILES
No files were found in Unicode matching



### Check for and Remove Empty Files

This cell verifies the integrity of the text files by checking if any of the matched files are truly empty (0 bytes in size). Empty files are identified and then removed from the dataset, as they would not contain any useful information for analysis. It also provides a breakdown of empty files by sector and country to identify any patterns.

In [19]:
print(f"\n{'='*60}")
print(f"CHECKING FOR EMPTY FILES (0 BYTES)")
print(f"{'='*60}")

folder_path = '/content/drive/MyDrive/Dav_project/txt_converted'

def is_truly_empty(filename):
    filepath = os.path.join(folder_path, filename)
    try:
        return os.path.getsize(filepath) == 0
    except Exception as e:
        print(f"Error checking {filename}: {e}")
        return True

# Check all matched files
matched['file_empty'] = matched['filename'].apply(is_truly_empty)
empty_count = matched['file_empty'].sum()
print(f"\nTotal files checked: {len(matched)}")
print(f"Truly empty files (0 bytes): {empty_count}")

# Show empty files if any
if empty_count > 0:
    print(f"\n{'='*40}")
    print(f"EMPTY FILES TO BE REMOVED:")
    print(f"{'='*40}")
    empty_files = matched[matched['file_empty']].copy()

    # Show them grouped by possible reasons
    print("\nFirst 20 empty files:")
    for idx, row in empty_files.head(20).iterrows():
        print(f"  ❌ {row['filename']} - {row.get('Name', 'Unknown')}")

    if empty_count > 20:
        print(f"  ... and {empty_count - 20} more")

    # Check if certain sectors/countries have more empty files
    if 'Sector' in empty_files.columns:
        print(f"\nEmpty files by sector:")
        print(empty_files['Sector'].value_counts().head(10))

    if 'Country' in empty_files.columns:
        print(f"\nEmpty files by country:")
        print(empty_files['Country'].value_counts().head(10))

    # Remove empty files
    final_data = matched[~matched['file_empty']].copy()
    print(f"\n{'='*40}")
    print(f"AFTER REMOVING EMPTY FILES:")
    print(f"{'='*40}")
    print(f"Files kept: {len(final_data)}")
    print(f"Files removed: {empty_count}")

else:
    print(f"\n✅ No empty files found!")
    final_data = matched.copy()

print(f"\nFinal dataset size: {len(final_data)} reports")


CHECKING FOR EMPTY FILES (0 BYTES)

Total files checked: 5790
Truly empty files (0 bytes): 108

EMPTY FILES TO BE REMOVED:

First 20 empty files:
  ❌ QatarInsuranceCompany_2017.txt - Qatar Insurance Company
  ❌ VTRGlobalCom_2015.txt - VTR GlobalCom
  ❌ WindHellasTelecommunication_2012.txt - Wind Hellas Telecommunication
  ❌ WpgHoldingsLimited_2016.txt - Wpg Holdings Limited
  ❌ Sustainalytics_2017.txt - Sustainalytics
  ❌ SporveienOslo_2017.txt - Sporveien Oslo
  ❌ TaiwanSugarCorporation_2015.txt - Taiwan Sugar Corporation
  ❌ TrueCorporation_2017.txt - True Corporation
  ❌ ThaiOpticalGroup_2016.txt - Thai Optical Group
  ❌ TheSaudiInvestmentBank(SAIB)_2015.txt - The Saudi Investment Bank (SAIB)
  ❌ TimeWarnerCable_2015.txt - Time Warner Cable
  ❌ TopviewOptronicsCorporation_2017.txt - Topview Optronics Corporation
  ❌ SamsungSecurities_2017.txt - Samsung Securities
  ❌ SecretaríaGeneraldelaGobernación-GobiernodelaProvinciadeCórdoba_2014.txt - nan
  ❌ AviationIndustryCorporationofC

### Save Final Dataset and Summary

This cell saves the cleaned and filtered dataset (`final_data`) to a new CSV file named `final_reports_clean.csv` in your Google Drive. It then prints a summary of the final dataset, including the total number of reports, unique companies, sectors, countries, and the distribution of English vs. non-English reports.

In [21]:
print(f"\n{'='*60}")
print(f"SAVING FINAL DATASET")
print(f"{'='*60}")

final_data.to_csv('/content/drive/MyDrive/Dav_project/final_reports_clean.csv', index=False)


# Final summary
print(f"\n{'='*60}")
print(f"FINAL DATASET SUMMARY")
print(f"{'='*60}")
print(f"Total reports: {len(final_data)}")
print(f"Unique companies: {final_data['Name'].nunique()}")
print(f"Unique sectors: {final_data['Sector'].nunique()}")
print(f"Unique countries: {final_data['Country'].nunique()}")
print(f"\nEnglish reports: {final_data['english_non_english'].value_counts().get('english', 0)}")
print(f"Non-English reports: {final_data['english_non_english'].value_counts().get('non-english', 0)}")


SAVING FINAL DATASET

FINAL DATASET SUMMARY
Total reports: 5682
Unique companies: 2878
Unique sectors: 38
Unique countries: 87

English reports: 1632
Non-English reports: 3918


### List All Sectors and Countries

This cell generates and displays a complete list of all unique sectors and countries present in the `final_data` DataFrame, along with the count of reports for each. It also saves these lists to separate CSV files (`all_sectors_list.csv` and `all_countries_list.csv`) for easy access and review.

In [23]:
print(f"\n{'='*60}")
print(f"ALL SECTORS - COMPLETE LIST")
print(f"{'='*60}")

# Get all sectors with counts
all_sectors = final_data['Sector'].value_counts().sort_values(ascending=False)

# Print each sector with count
for sector, count in all_sectors.items():
    print(f"{sector}: {count}")

# Save to file for easy viewing


print(f"\n{'='*60}")
print(f"ALL COUNTRIES - COMPLETE LIST")
print(f"{'='*60}")

# Get all countries with counts
all_countries = final_data['Country'].value_counts().sort_values(ascending=False)

# Print each country with count
for country, count in all_countries.items():
    print(f"{country}: {count}")



print(f"\n{'='*60}")
print(f"SUMMARY")
print(f"{'='*60}")
print(f"Total sectors: {len(all_sectors)}")
print(f"Total countries: {len(all_countries)}")
print(f"Total reports: {len(final_data)}")


ALL SECTORS - COMPLETE LIST
Financial Services: 581
Other: 558
Food and Beverage Products: 358
Chemicals: 324
Energy: 308
Equipment: 289
Technology Hardware: 272
Real Estate: 200
Metals Products: 184
Construction: 177
Healthcare Products: 162
Energy Utilities: 158
Automotive: 154
Conglomerates: 152
Telecommunications: 151
Retailers: 141
Construction Materials: 121
Logistics: 121
Mining: 119
Non-Profit / Services: 99
Commercial Services: 93
Tourism/Leisure: 76
Consumer Durables: 73
Computers: 73
Aviation: 67
Textiles and Apparel: 67
Household and Personal Products: 62
Water Utilities: 53
Healthcare Services: 52
Railroad: 51
Forest and Paper Products: 51
Agriculture: 47
Public Agency: 46
Media: 45
Universities: 29
Waste Management: 22
Toys: 10
Tobacco: 4

ALL COUNTRIES - COMPLETE LIST
Mainland China: 1407
Japan: 1097
Taiwan: 670
United States of America: 320
Germany: 118
India: 109
Hong Kong: 90
Canada: 88
Colombia: 81
Greece: 80
Russian Federation: 80
Finland: 80
Spain: 76
United Kingd

### Analyze Sectors and Countries by Language

This cell performs a cross-tabulation to show the distribution of reports by `Sector` and `Country` against their `english_non_english` status. It provides insights into which sectors and countries have more English or non-English reports, displaying the top 20 entries for each category.

In [24]:
print(f"\n{'='*60}")
print(f"SECTORS BY LANGUAGE (Top 20)")
print(f"{'='*60}")

# Cross-tab of sectors by language
sector_lang = pd.crosstab(final_data['Sector'], final_data['english_non_english'])
print(sector_lang.head(20))

print(f"\n{'='*60}")
print(f"COUNTRIES BY LANGUAGE (Top 20)")
print(f"{'='*60}")

# Cross-tab of countries by language
country_lang = pd.crosstab(final_data['Country'], final_data['english_non_english'])
print(country_lang.head(20))


SECTORS BY LANGUAGE (Top 20)
english_non_english              english  non-english
Sector                                               
Agriculture                           13           34
Automotive                            49          105
Aviation                              24           43
Chemicals                             64          260
Commercial Services                   46           47
Computers                             23           50
Conglomerates                         47          105
Construction                          43          134
Construction Materials                30           91
Consumer Durables                     28           45
Energy                               102          206
Energy Utilities                      53          105
Equipment                             79          210
Financial Services                   177          404
Food and Beverage Products            91          267
Forest and Paper Products             27           2

In [25]:
print(f"\n{'='*60}")
print(f"TOP 5 SECTORS - REPORTS PER YEAR")
print(f"{'='*60}")

# Get top 5 sectors
top_sectors = final_data['Sector'].value_counts().head(5).index.tolist()

# Filter to top sectors
top_sectors_data = final_data[final_data['Sector'].isin(top_sectors)]

# Pivot table: years vs sectors
year_sector = pd.crosstab(top_sectors_data['Year'], top_sectors_data['Sector'])
print(year_sector)


TOP 5 SECTORS - REPORTS PER YEAR
Sector  Chemicals  Energy  Financial Services  Food and Beverage Products  \
Year                                                                        
2000.0          0       0                   1                           0   
2002.0          0       0                   1                           0   
2003.0          0       1                   0                           1   
2004.0          0       0                   1                           2   
2005.0          1       0                   3                           1   
2006.0          6       4                   1                           3   
2007.0          9       7                   2                           5   
2008.0          9       9                  10                           7   
2009.0         10       6                  12                           6   
2010.0         12      14                  21                          12   
2011.0         10      18                 

In [26]:
# Find companies with reports in both periods
pre_years = [2005, 2006, 2007]
post_years = [2009, 2010, 2011]

pre_companies = set(final_data[final_data['Year'].isin(pre_years)]['Name'])
post_companies = set(final_data[final_data['Year'].isin(post_years)]['Name'])

survivors = pre_companies.intersection(post_companies)
print(f"Companies that reported before AND after crisis: {len(survivors)}")

Companies that reported before AND after crisis: 52


In [27]:
# Check English vs Non-English for your target sectors
sectors_of_interest = ['Financial Services', 'Energy', 'Mining', 'Utilities', 'Construction',
                       'Food and Beverage Products', 'Retailers']

# Filter to your sectors
target_data = final_data[final_data['Sector'].isin(sectors_of_interest)]

# English breakdown by sector
print("ENGLISH REPORTS BY SECTOR")
print("-" * 50)
eng_by_sector = target_data[target_data['english_non_english'] == 'english']['Sector'].value_counts()
print(eng_by_sector)

print("\nNON-ENGLISH REPORTS BY SECTOR")
print("-" * 50)
noneng_by_sector = target_data[target_data['english_non_english'] == 'non-english']['Sector'].value_counts()
print(noneng_by_sector)

# Total English in your target sectors
total_eng = target_data[target_data['english_non_english'] == 'english'].shape[0]
print(f"\nTOTAL ENGLISH REPORTS IN TARGET SECTORS: {total_eng}")

ENGLISH REPORTS BY SECTOR
--------------------------------------------------
Sector
Financial Services            177
Energy                        102
Food and Beverage Products     91
Retailers                      45
Construction                   43
Mining                         42
Name: count, dtype: int64

NON-ENGLISH REPORTS BY SECTOR
--------------------------------------------------
Sector
Financial Services            404
Food and Beverage Products    267
Energy                        206
Construction                  134
Retailers                      96
Mining                         77
Name: count, dtype: int64

TOTAL ENGLISH REPORTS IN TARGET SECTORS: 500


In [28]:
# See all English counts
eng_counts = final_data[final_data['english_non_english'] == 'english']['Sector'].value_counts()
print(eng_counts.head(20))


Sector
Financial Services                 177
Other                              121
Energy                             102
Food and Beverage Products          91
Equipment                           79
Technology Hardware                 75
Real Estate                         69
Chemicals                           64
Energy Utilities                    53
Healthcare Products                 52
Automotive                          49
Conglomerates                       47
Telecommunications                  46
Commercial Services                 46
Logistics                           45
Retailers                           45
Construction                        43
Mining                              42
Metals Products                     42
Household and Personal Products     35
Name: count, dtype: int64


In [30]:
# Define your three groups
industrial = ['Energy', 'Energy Utilities', 'Mining', 'Metals Products',
              'Chemicals', 'Construction', 'Automotive', 'Equipment']

financial = ['Financial Services', 'Real Estate', 'Commercial Services',
             'Logistics', 'Conglomerates']

consumer = ['Food and Beverage Products', 'Retailers', 'Technology Hardware',
            'Telecommunications', 'Healthcare Products', 'Household and Personal Products']




# Create group labels
final_data['Group'] = 'Other'  # default
final_data.loc[final_data['Sector'].isin(industrial), 'Group'] = 'Industrial'
final_data.loc[final_data['Sector'].isin(financial), 'Group'] = 'Financial'
final_data.loc[final_data['Sector'].isin(consumer), 'Group'] = 'Consumer'

# Filter to only these groups and English
analysis_data = final_data[
    (final_data['Group'] != 'Other') &
    (final_data['english_non_english'] == 'english')
].copy()

print(f"Final analysis dataset: {len(analysis_data)} English reports")
print(analysis_data['Group'].value_counts())

Final analysis dataset: 1202 English reports
Group
Industrial    474
Financial     384
Consumer      344
Name: count, dtype: int64


In [31]:
print(f"Total English reports: {len(analysis_data)}")
print(f"Year range: {analysis_data['Year'].min()} - {analysis_data['Year'].max()}")

Total English reports: 1202
Year range: 2002.0 - 2018.0


In [32]:
# Filter to Era 1: 2000-2008
era1 = analysis_data[analysis_data['Year'].between(2000, 2008)].copy()

print("ERA 1 (2000-2008)")
print("=" * 50)
print(f"Total English reports: {len(era1)}")
print("\nBy group:")
print(era1['Group'].value_counts())
print("\nBy year:")
print(era1['Year'].value_counts().sort_index())

ERA 1 (2000-2008)
Total English reports: 56

By group:
Group
Consumer      25
Industrial    21
Financial     10
Name: count, dtype: int64

By year:
Year
2002.0     3
2003.0     1
2004.0     1
2005.0     4
2006.0     6
2007.0    21
2008.0    20
Name: count, dtype: int64


In [33]:
# Check non-English in 2009-2012
non_eng_gap = final_data[
    (final_data['english_non_english'] == 'non-english') &
    (final_data['Year'].between(2009, 2012))
].copy()

print(f"Non-English reports 2009-2012: {len(non_eng_gap)}")

Non-English reports 2009-2012: 555


In [34]:
# For Industrial group
industrial_data = analysis_data[analysis_data['Group'] == 'Industrial']
print("INDUSTRIAL GROUP - Reports by Country")
print("=" * 60)
print(industrial_data['Country'].value_counts().head(20))
print("\n")

# For Financial group
financial_data = analysis_data[analysis_data['Group'] == 'Financial']
print("FINANCIAL GROUP - Reports by Country")
print("=" * 60)
print(financial_data['Country'].value_counts().head(20))
print("\n")

# For Consumer group
consumer_data = analysis_data[analysis_data['Group'] == 'Consumer']
print("CONSUMER GROUP - Reports by Country")
print("=" * 60)
print(consumer_data['Country'].value_counts().head(20))

INDUSTRIAL GROUP - Reports by Country
Country
Japan                                                   142
United States of America                                 68
India                                                    43
Canada                                                   19
Germany                                                  17
France                                                   17
United Kingdom of Great Britain and Northern Ireland     17
Korea, Republic of                                       12
Finland                                                  11
Switzerland                                              11
South Africa                                             10
Russian Federation                                        9
Italy                                                     9
Netherlands                                               8
Poland                                                    8
Australia                                             

In [35]:
# See companies with their countries and report counts
company_summary = analysis_data.groupby(['Name', 'Country']).size().reset_index(name='report_count')
company_summary = company_summary.sort_values('report_count', ascending=False)

print("TOP 50 COMPANIES - WITH COUNTRY AND REPORT COUNT")
print("=" * 80)
print(company_summary.head(50).to_string(index=False))

TOP 50 COMPANIES - WITH COUNTRY AND REPORT COUNT
                         Name                                              Country  report_count
                       TERUMO                                                Japan             9
              Infosys Limited                                                India             8
                     NSK Ltd.                                                Japan             7
                   Iino Lines                                                Japan             7
                      Iwatani                                                Japan             7
            Jtekt Corporation                                                Japan             7
               TECHNO ASSOCIE                                                Japan             6
          THK Company Limited                                                Japan             6
                  TNT Express                                          Netherl

In [36]:
# See the 78 English reports from Japan 2009-2012
japan_english_gap = final_data[
    (final_data['Country'] == 'Japan') &
    (final_data['Year'].between(2009, 2012)) &
    (final_data['english_non_english'] == 'english')
].copy()

print(f"Japanese English reports 2009-2012: {len(japan_english_gap)}")
print("\nBy year:")
print(japan_english_gap['Year'].value_counts().sort_index())
print("\nSample companies:")
print(japan_english_gap['Name'].head(10).tolist())

Japanese English reports 2009-2012: 78

By year:
Year
2009.0     6
2010.0    12
2011.0    26
2012.0    34
Name: count, dtype: int64

Sample companies:
['Taisei', 'TDK', 'TERUMO', 'TERUMO', 'TERUMO', 'TERUMO', 'THK Company Limited', 'Toray Industries Inc', 'Toyoda Gosei Company Limited', 'Toyoda Gosei Company Limited']


In [37]:
analysis_data = final_data[
    (final_data['Group'] != 'Other') &
    (final_data['english_non_english'] == 'english')
].copy()
gap_check = analysis_data[analysis_data['Year'].between(2009, 2012)]
print(len(gap_check))

234


In [None]:
# Get the 78 Japanese English reports
japan_78 = final_data[
    (final_data['Country'] == 'Japan') &
    (final_data['Year'].between(2009, 2012)) &
    (final_data['english_non_english'] == 'english')
].copy()

print("Sector names in these 78 reports:")
print(japan_78['Sector'].value_counts())

Sector names in these 78 reports:
Sector
Equipment                 10
Financial Services         6
Chemicals                  6
Automotive                 6
Energy                     6
Technology Hardware        5
Metals Products            5
Healthcare Products        4
Energy Utilities           4
Retailers                  4
Construction               3
Logistics                  3
Other                      3
Railroad                   3
Conglomerates              2
Mining                     2
Real Estate                2
Textiles and Apparel       1
Construction Materials     1
Telecommunications         1
Computers                  1
Name: count, dtype: int64


In [38]:
# English reports only during 2009-2012
english_2009_2012 = final_data[
    (final_data['Year'].between(2009, 2012)) &
    (final_data['english_non_english'] == 'english')
].copy()

print(f"ENGLISH REPORTS 2009-2012: {len(english_2009_2012)}")
print("\nBY YEAR:")
print(english_2009_2012['Year'].value_counts().sort_index())
print("\nBY SECTOR (TOP 10):")
print(english_2009_2012['Sector'].value_counts().head(10))
print("\nBY COUNTRY (TOP 10):")
print(english_2009_2012['Country'].value_counts().head(10))
print("\nBY GROUP:")
if 'Group' in english_2009_2012.columns:
    print(english_2009_2012['Group'].value_counts())

ENGLISH REPORTS 2009-2012: 309

BY YEAR:
Year
2009.0     32
2010.0     54
2011.0    100
2012.0    123
Name: count, dtype: int64

BY SECTOR (TOP 10):
Sector
Financial Services            40
Other                         20
Equipment                     18
Energy                        18
Technology Hardware           14
Food and Beverage Products    14
Healthcare Products           12
Telecommunications            12
Automotive                    12
Chemicals                     11
Name: count, dtype: int64

BY COUNTRY (TOP 10):
Country
Japan                       78
United States of America    50
Canada                      18
Hong Kong                   13
South Africa                12
Australia                   11
India                       11
Netherlands                  9
Finland                      8
Mainland China               8
Name: count, dtype: int64

BY GROUP:
Group
Industrial    99
Other         75
Consumer      68
Financial     67
Name: count, dtype: int64


In [None]:
# First, make sure your group lists match EXACT sector names
industrial = ['Energy', 'Energy Utilities', 'Mining', 'Metals Products',
              'Chemicals', 'Construction', 'Automotive', 'Equipment']

financial = ['Financial Services', 'Real Estate', 'Commercial Services',
             'Logistics', 'Conglomerates']

consumer = ['Food and Beverage Products', 'Retailers', 'Technology Hardware',
            'Telecommunications', 'Healthcare Products', 'Household and Personal Products']

# Create group labels (overwriting any existing 'Group' column)
final_data['Group'] = 'Other'
final_data.loc[final_data['Sector'].isin(industrial), 'Group'] = 'Industrial'
final_data.loc[final_data['Sector'].isin(financial), 'Group'] = 'Financial'
final_data.loc[final_data['Sector'].isin(consumer), 'Group'] = 'Consumer'

# Now create analysis_data with ALL English reports
analysis_data = final_data[
    (final_data['english_non_english'] == 'english')
].copy()

print(f"Total English reports: {len(analysis_data)}")
print("\nBy group:")
print(analysis_data['Group'].value_counts())
print("\nBy year:")
print(analysis_data['Year'].value_counts().sort_index())

Total English reports: 1366

By group:
Group
Industrial    393
Other         360
Financial     322
Consumer      291
Name: count, dtype: int64

By year:
Year
2002.0      4
2003.0      2
2004.0      2
2005.0      6
2006.0      9
2007.0     24
2008.0     24
2009.0     32
2010.0     54
2011.0    100
2012.0    123
2013.0    194
2014.0    233
2015.0    242
2016.0    317
Name: count, dtype: int64


In [None]:
# Filter to 2013-2016, exclude Other group
analysis_2013_2016 = analysis_data[
    (analysis_data['Year'].between(2000, 2016)) &
    (analysis_data['Group'] != 'Other')
].copy()

print("ENGLISH REPORTS 2000-2016 (EXCLUDING OTHER)")
print("=" * 60)
print(f"Total: {len(analysis_2013_2016)}")
print("\nBy group:")
print(analysis_2013_2016['Group'].value_counts())
print("\nBy year:")
print(analysis_2013_2016['Year'].value_counts().sort_index())

ENGLISH REPORTS 2000-2016 (EXCLUDING OTHER)
Total: 1006

By group:
Group
Industrial    393
Financial     322
Consumer      291
Name: count, dtype: int64

By year:
Year
2002.0      3
2003.0      1
2004.0      1
2005.0      4
2006.0      6
2007.0     21
2008.0     20
2009.0     25
2010.0     43
2011.0     77
2012.0     89
2013.0    148
2014.0    168
2015.0    180
2016.0    220
Name: count, dtype: int64


In [None]:
# For each group, show sectors and countries

groups = ['Industrial', 'Financial', 'Consumer']

for group in groups:
    print(f"\n{'='*60}")
    print(f"{group} GROUP - SECTORS")
    print(f"{'='*60}")

    group_data = analysis_data[analysis_data['Group'] == group]

    # Sectors within this group
    print(group_data['Sector'].value_counts().head(10))

    print(f"\n{group} GROUP - COUNTRIES")
    print(f"{'='*60}")

    # Countries within this group
    print(group_data['Country'].value_counts().head(10))
    print("\n")


Industrial GROUP - SECTORS
Sector
Energy              83
Equipment           68
Chemicals           57
Energy Utilities    43
Automotive          38
Construction        36
Metals Products     35
Mining              33
Name: count, dtype: int64

Industrial GROUP - COUNTRIES
Country
Japan                                                   130
United States of America                                 56
India                                                    34
Germany                                                  16
Canada                                                   15
France                                                   14
United Kingdom of Great Britain and Northern Ireland     14
Switzerland                                              10
Russian Federation                                        9
Korea, Republic of                                        8
Name: count, dtype: int64



Financial GROUP - SECTORS
Sector
Financial Services     153
Real Estate             56
L

In [None]:
# Create company summary table
company_summary = analysis_data.groupby(
    ['Name', 'Sector', 'Country', 'Group']
).size().reset_index(name='report_count')

# Sort by report count (highest first)
company_summary = company_summary.sort_values('report_count', ascending=False)

# Display top 50 companies
print("=" * 100)
print("TOP 50 COMPANIES - WITH SECTOR, COUNTRY, GROUP, AND REPORT COUNT")
print("=" * 100)
print(company_summary.head(50).to_string(index=False))

# Optional: Save to CSV
company_summary.to_csv('company_summary.csv', index=False)
print("\n✅ Full company summary saved to 'company_summary.csv'")

TOP 50 COMPANIES - WITH SECTOR, COUNTRY, GROUP, AND REPORT COUNT
                         Name                          Sector                  Country      Group  report_count
                       TERUMO             Healthcare Products                    Japan   Consumer             9
                      Iwatani                Energy Utilities                    Japan Industrial             7
              Infosys Limited             Commercial Services                    India  Financial             7
            Jtekt Corporation                 Metals Products                    Japan Industrial             7
                     NSK Ltd.                       Equipment                    Japan Industrial             7
                  TNT Express                       Logistics              Netherlands  Financial             6
 Toyoda Gosei Company Limited                      Automotive                    Japan Industrial             6
                   Iino Lines          

In [None]:
# Count reports per company
company_freq = analysis_data['Name'].value_counts()

print("REPORTS PER COMPANY DISTRIBUTION")
print("=" * 50)
print(f"Companies with 1 report:  {(company_freq == 1).sum()}")
print(f"Companies with 2 reports: {(company_freq == 2).sum()}")
print(f"Companies with 3 reports: {(company_freq == 3).sum()}")
print(f"Companies with 4+ reports: {(company_freq >= 4).sum()}")

REPORTS PER COMPANY DISTRIBUTION
Companies with 1 report:  580
Companies with 2 reports: 168
Companies with 3 reports: 64
Companies with 4+ reports: 52
